2025-05-07T20:22:35.2900281Z Current runner version: '2.323.0'
2025-05-07T20:22:35.2907631Z Runner name: 'i-09c05d8e2aea2c844'
2025-05-07T20:22:35.2908654Z Machine name: 'ip-10-0-65-139'
2025-05-07T20:22:35.2911533Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:22:35.2913935Z Contents: read
2025-05-07T20:22:35.2914458Z Metadata: read
2025-05-07T20:22:35.2914974Z Packages: read
2025-05-07T20:22:35.2915478Z ##[endgroup]
2025-05-07T20:22:35.2917704Z Secret source: None
2025-05-07T20:22:35.2918799Z Prepare workflow directory
2025-05-07T20:22:35.3443409Z Prepare all required actions
2025-05-07T20:22:35.3480153Z Getting action download info
2025-05-07T20:22:35.5695775Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:22:35.7981248Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:22:36.0540288Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:22:37.6063262Z Getting action download info
2025-05-07T20:22:37.6974440Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:22:37.9570064Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.8.0, 12.6.3, clang)
2025-05-07T20:22:38.0198720Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:22:38.0335890Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:22:38.0349018Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:22:38.0350615Z ##[endgroup]
2025-05-07T20:22:39.2774758Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:22:39.2775200Z Instance Type: g5.4xlarge
2025-05-07T20:22:39.2775464Z AMI Name: unknown
2025-05-07T20:22:39.2816319Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:22:44.7176826Z ##[group]Run actions/checkout@v4
2025-05-07T20:22:44.7177157Z with:
2025-05-07T20:22:44.7177423Z   submodules: true
2025-05-07T20:22:44.7177672Z   repository: pytorch/FBGEMM
2025-05-07T20:22:44.7178078Z   token: ***
2025-05-07T20:22:44.7178296Z   ssh-strict: true
2025-05-07T20:22:44.7178522Z   ssh-user: git
2025-05-07T20:22:44.7178753Z   persist-credentials: true
2025-05-07T20:22:44.7179020Z   clean: true
2025-05-07T20:22:44.7179259Z   sparse-checkout-cone-mode: true
2025-05-07T20:22:44.7179539Z   fetch-depth: 1
2025-05-07T20:22:44.7179765Z   fetch-tags: false
2025-05-07T20:22:44.7179994Z   show-progress: true
2025-05-07T20:22:44.7180226Z   lfs: false
2025-05-07T20:22:44.7180444Z   set-safe-directory: true
2025-05-07T20:22:44.7180704Z env:
2025-05-07T20:22:44.7180923Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:44.7181244Z   BUILD_ENV: build_binary
2025-05-07T20:22:44.7181511Z   BUILD_TARGET: genai
2025-05-07T20:22:44.7181750Z   BUILD_VARIANT: cuda
2025-05-07T20:22:44.7182025Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:44.7182290Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:44.7182545Z ##[endgroup]
2025-05-07T20:22:44.8359806Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:22:44.8361099Z ##[group]Getting Git version info
2025-05-07T20:22:44.8361563Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8362198Z [command]/usr/bin/git version
2025-05-07T20:22:44.8362478Z git version 2.47.1
2025-05-07T20:22:44.8366966Z ##[endgroup]
2025-05-07T20:22:44.8389458Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c98a28e5-12a9-4c25-a94b-b5f916230478' before making global git config changes
2025-05-07T20:22:44.8390377Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:22:44.8394941Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8432026Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:22:44.8435336Z ##[group]Initializing the repository
2025-05-07T20:22:44.8439450Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:44.8480383Z hint: Using 'master' as the name for the initial branch. This default branch name
2025-05-07T20:22:44.8481002Z hint: is subject to change. To configure the initial branch name to use in all
2025-05-07T20:22:44.8481558Z hint: of your new repositories, which will suppress this warning, call:
2025-05-07T20:22:44.8481962Z hint:
2025-05-07T20:22:44.8482275Z hint: 	git config --global init.defaultBranch <name>
2025-05-07T20:22:44.8482626Z hint:
2025-05-07T20:22:44.8482965Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2025-05-07T20:22:44.8483527Z hint: 'development'. The just-created branch can be renamed via this command:
2025-05-07T20:22:44.8483967Z hint:
2025-05-07T20:22:44.8484195Z hint: 	git branch -m <name>
2025-05-07T20:22:44.8484717Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/
2025-05-07T20:22:44.8492220Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM
2025-05-07T20:22:44.8528044Z ##[endgroup]
2025-05-07T20:22:44.8528509Z ##[group]Disabling automatic garbage collection
2025-05-07T20:22:44.8532196Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:22:44.8565193Z ##[endgroup]
2025-05-07T20:22:44.8565617Z ##[group]Setting up auth
2025-05-07T20:22:44.8571332Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:22:44.8602815Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:22:44.8978599Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:22:44.9010963Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:22:44.9359837Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:44.9410319Z ##[endgroup]
2025-05-07T20:22:44.9410786Z ##[group]Fetching the repository
2025-05-07T20:22:44.9418295Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:22:45.3961534Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:22:45.3962068Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:22:45.3998488Z ##[endgroup]
2025-05-07T20:22:45.3998901Z ##[group]Determining the checkout info
2025-05-07T20:22:45.4001628Z ##[endgroup]
2025-05-07T20:22:45.4007292Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:22:45.4046064Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:22:45.4087792Z ##[group]Checking out the ref
2025-05-07T20:22:45.4091638Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:22:45.5172855Z Note: switching to 'refs/remotes/pull/4066/merge'.
2025-05-07T20:22:45.5173200Z 
2025-05-07T20:22:45.5173494Z You are in 'detached HEAD' state. You can look around, make experimental
2025-05-07T20:22:45.5174031Z changes and commit them, and you can discard any commits you make in this
2025-05-07T20:22:45.5174564Z state without impacting any branches by switching back to a branch.
2025-05-07T20:22:45.5174884Z 
2025-05-07T20:22:45.5175107Z If you want to create a new branch to retain commits you create, you may
2025-05-07T20:22:45.5175593Z do so (now or later) by using -c with the switch command. Example:
2025-05-07T20:22:45.5175871Z 
2025-05-07T20:22:45.5175992Z   git switch -c <new-branch-name>
2025-05-07T20:22:45.5176197Z 
2025-05-07T20:22:45.5176361Z Or undo this operation with:
2025-05-07T20:22:45.5176546Z 
2025-05-07T20:22:45.5176648Z   git switch -
2025-05-07T20:22:45.5177045Z 
2025-05-07T20:22:45.5177284Z Turn off this advice by setting config variable advice.detachedHead to false
2025-05-07T20:22:45.5177640Z 
2025-05-07T20:22:45.5178033Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:22:45.5186540Z ##[endgroup]
2025-05-07T20:22:45.5186979Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:22:45.5192253Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:22:45.5241628Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:22:45.5273762Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:22:45.5307062Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:22:45.5335356Z ##[endgroup]
2025-05-07T20:22:45.5335767Z ##[group]Fetching submodules
2025-05-07T20:22:45.5338860Z [command]/usr/bin/git submodule sync
2025-05-07T20:22:45.5686511Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:22:45.6021218Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit'
2025-05-07T20:22:45.6024775Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel'
2025-05-07T20:22:45.6028149Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo'
2025-05-07T20:22:45.6032710Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass'
2025-05-07T20:22:45.6036639Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest'
2025-05-07T20:22:45.6040872Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch'
2025-05-07T20:22:45.6044339Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json'
2025-05-07T20:22:45.6077121Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'...
2025-05-07T20:22:45.9180885Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'...
2025-05-07T20:22:46.4841145Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'...
2025-05-07T20:22:47.0385484Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'...
2025-05-07T20:22:48.1051024Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'...
2025-05-07T20:22:48.4545988Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'...
2025-05-07T20:22:48.7984282Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'...
2025-05-07T20:22:49.9716653Z From https://github.com/asmjit/asmjit
2025-05-07T20:22:49.9717242Z  * branch            e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD
2025-05-07T20:22:50.0198784Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:22:50.7287006Z From https://github.com/jwfromm/composable_kernel
2025-05-07T20:22:50.7287489Z  * branch            4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD
2025-05-07T20:22:51.0101912Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:22:51.6752366Z From https://github.com/pytorch/cpuinfo
2025-05-07T20:22:51.6753972Z  * branch            6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD
2025-05-07T20:22:51.7847925Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:22:52.9290707Z From https://github.com/jwfromm/cutlass
2025-05-07T20:22:52.9291339Z  * branch            3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD
2025-05-07T20:22:53.6293158Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:22:54.4226596Z From https://github.com/google/googletest
2025-05-07T20:22:54.4227078Z  * branch            f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD
2025-05-07T20:22:54.4637058Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:22:55.1519707Z From https://github.com/ROCmSoftwarePlatform/hipify_torch
2025-05-07T20:22:55.1520228Z  * branch            420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD
2025-05-07T20:22:55.1606190Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:22:55.9522517Z From https://github.com/nlohmann/json
2025-05-07T20:22:55.9523452Z  * branch            9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD
2025-05-07T20:22:56.0673803Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:22:56.0692594Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:22:56.1035562Z Entering 'external/asmjit'
2025-05-07T20:22:56.1068790Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.1100417Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.1132586Z Entering 'external/cutlass'
2025-05-07T20:22:56.1164078Z Entering 'external/googletest'
2025-05-07T20:22:56.1196231Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.1228284Z Entering 'external/json'
2025-05-07T20:22:56.1275206Z ##[endgroup]
2025-05-07T20:22:56.1275646Z ##[group]Persisting credentials for submodules
2025-05-07T20:22:56.1282187Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:22:56.1620297Z Entering 'external/asmjit'
2025-05-07T20:22:56.1687724Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.1760053Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.1826220Z Entering 'external/cutlass'
2025-05-07T20:22:56.1900560Z Entering 'external/googletest'
2025-05-07T20:22:56.1967759Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.2039895Z Entering 'external/json'
2025-05-07T20:22:56.2124532Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:22:56.2463720Z Entering 'external/asmjit'
2025-05-07T20:22:56.2528199Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:22:56.2531397Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.2594384Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:22:56.2597060Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.2658900Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:22:56.2661699Z Entering 'external/cutlass'
2025-05-07T20:22:56.2723037Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:22:56.2726547Z Entering 'external/googletest'
2025-05-07T20:22:56.2790404Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:22:56.2793792Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.2856634Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:22:56.2859099Z Entering 'external/json'
2025-05-07T20:22:56.2920206Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:22:56.3007596Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:22:56.3348414Z Entering 'external/asmjit'
2025-05-07T20:22:56.3381318Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.3415408Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.3447719Z Entering 'external/cutlass'
2025-05-07T20:22:56.3480728Z Entering 'external/googletest'
2025-05-07T20:22:56.3513448Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.3546437Z Entering 'external/json'
2025-05-07T20:22:56.3595168Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:22:56.3934290Z Entering 'external/asmjit'
2025-05-07T20:22:56.3967884Z Entering 'external/composable_kernel'
2025-05-07T20:22:56.4000424Z Entering 'external/cpuinfo'
2025-05-07T20:22:56.4033163Z Entering 'external/cutlass'
2025-05-07T20:22:56.4065665Z Entering 'external/googletest'
2025-05-07T20:22:56.4098664Z Entering 'external/hipify_torch'
2025-05-07T20:22:56.4132022Z Entering 'external/json'
2025-05-07T20:22:56.4201411Z ##[endgroup]
2025-05-07T20:22:56.4221739Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:22:56.4251278Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:22:56.4436666Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:22:56.4437002Z with:
2025-05-07T20:22:56.4437262Z   name: fbgemm_genai_x86_clang_py3.10_cu12.8.0.whl
2025-05-07T20:22:56.4437614Z   merge-multiple: false
2025-05-07T20:22:56.4437885Z   repository: pytorch/FBGEMM
2025-05-07T20:22:56.4438174Z   run-id: 14891846252
2025-05-07T20:22:56.4438399Z env:
2025-05-07T20:22:56.4438645Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:56.4438972Z   BUILD_ENV: build_binary
2025-05-07T20:22:56.4439229Z   BUILD_TARGET: genai
2025-05-07T20:22:56.4439470Z   BUILD_VARIANT: cuda
2025-05-07T20:22:56.4439724Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:56.4439992Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:56.4440250Z ##[endgroup]
2025-05-07T20:22:56.6808931Z Downloading single artifact
2025-05-07T20:22:56.7712018Z Preparing to download the following artifacts:
2025-05-07T20:22:56.7713067Z - fbgemm_genai_x86_clang_py3.10_cu12.8.0.whl (ID: 3081404175, Size: 18501011, Expected Digest: sha256:11df06046b7d4c3f3f186959566dfdd554d7e11b3fd21f4c28aab1ad73234076)
2025-05-07T20:22:56.8196918Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-73f140a1-be13-5b4d-b1a3-0329f4aec114/artifacts/7fea0a3d48a0a904e7ca275a23bd63820365acfdb69b50cc760cc4ba3d0dc013.zip
2025-05-07T20:22:56.8198485Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:22:56.8843863Z (node:57009) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:22:56.8844956Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:22:57.1433009Z SHA256 digest of downloaded artifact is 11df06046b7d4c3f3f186959566dfdd554d7e11b3fd21f4c28aab1ad73234076
2025-05-07T20:22:57.1433762Z Artifact download completed successfully.
2025-05-07T20:22:57.1434114Z Total of 1 artifact(s) downloaded
2025-05-07T20:22:57.1440009Z Download artifact has finished successfully
2025-05-07T20:22:57.1709254Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:22:57.1709672Z with:
2025-05-07T20:22:57.1709904Z   driver-version: 570.133.07
2025-05-07T20:22:57.1710175Z env:
2025-05-07T20:22:57.1710413Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.1710730Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.1710991Z   BUILD_TARGET: genai
2025-05-07T20:22:57.1711242Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.1711490Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:57.1711763Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.1712020Z ##[endgroup]
2025-05-07T20:22:57.1803035Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:22:57.1803436Z with:
2025-05-07T20:22:57.1803893Z   timeout_minutes: 10
2025-05-07T20:22:57.1804149Z   max_attempts: 3
2025-05-07T20:22:57.1828510Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:22:57.1852624Z   retry_wait_seconds: 10
2025-05-07T20:22:57.1852904Z   polling_interval_seconds: 1
2025-05-07T20:22:57.1853183Z   warning_on_retry: true
2025-05-07T20:22:57.1853448Z   continue_on_error: false
2025-05-07T20:22:57.1853704Z env:
2025-05-07T20:22:57.1853943Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:22:57.1854262Z   BUILD_ENV: build_binary
2025-05-07T20:22:57.1854528Z   BUILD_TARGET: genai
2025-05-07T20:22:57.1854771Z   BUILD_VARIANT: cuda
2025-05-07T20:22:57.1855032Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:22:57.1855307Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:22:57.1855565Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:22:57.1855828Z ##[endgroup]
2025-05-07T20:22:57.2653179Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:22:57.2653809Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:22:57.2658099Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:22:57.8167653Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:22:57.8181118Z No packages marked for removal.
2025-05-07T20:22:57.8231615Z Dependencies resolved.
2025-05-07T20:22:57.8241042Z Nothing to do.
2025-05-07T20:22:57.8241423Z Complete!
2025-05-07T20:22:57.8547901Z + install_nvidia_driver_common
2025-05-07T20:22:57.8552770Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:22:57.8553083Z + lspci
2025-05-07T20:22:57.8555013Z Before installing NVIDIA driver
2025-05-07T20:22:57.8740811Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:57.8742047Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:57.8742998Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:57.8743882Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:57.8744663Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:57.8745574Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:57.8746363Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:57.8747170Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:57.8747845Z + lsmod
2025-05-07T20:22:57.8783848Z Module                  Size  Used by
2025-05-07T20:22:57.8784205Z xt_conntrack           16384  1
2025-05-07T20:22:57.8784597Z nft_chain_nat          16384  3
2025-05-07T20:22:57.8785035Z xt_MASQUERADE          20480  1
2025-05-07T20:22:57.8785547Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:57.8786058Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:57.8786485Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:57.8786947Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:57.8787485Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:57.8787787Z xfrm_user              57344  1
2025-05-07T20:22:57.8788077Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:57.8788384Z xt_addrtype            16384  2
2025-05-07T20:22:57.8788652Z nft_compat             20480  4
2025-05-07T20:22:57.8788975Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:57.8789418Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:57.8789815Z br_netfilter           36864  0
2025-05-07T20:22:57.8790111Z bridge                323584  1 br_netfilter
2025-05-07T20:22:57.8790434Z stp                    16384  1 bridge
2025-05-07T20:22:57.8790731Z llc                    16384  2 bridge,stp
2025-05-07T20:22:57.8791024Z overlay               167936  0
2025-05-07T20:22:57.8791287Z tls                   135168  0
2025-05-07T20:22:57.8791552Z nls_ascii              16384  1
2025-05-07T20:22:57.8791812Z nls_cp437              20480  1
2025-05-07T20:22:57.8792078Z vfat                   24576  1
2025-05-07T20:22:57.8792356Z fat                    86016  1 vfat
2025-05-07T20:22:57.8792632Z sunrpc                696320  1
2025-05-07T20:22:57.8792892Z ena                   180224  0
2025-05-07T20:22:57.8793149Z i8042                  45056  0
2025-05-07T20:22:57.8793414Z serio                  28672  3 i8042
2025-05-07T20:22:57.8793808Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:57.8794095Z button                 24576  0
2025-05-07T20:22:57.8794362Z sch_fq_codel           20480  17
2025-05-07T20:22:57.8794628Z dm_mod                188416  0
2025-05-07T20:22:57.8794893Z fuse                  163840  1
2025-05-07T20:22:57.8795160Z loop                   36864  0
2025-05-07T20:22:57.8795423Z configfs               57344  1
2025-05-07T20:22:57.8795694Z dax                    45056  1 dm_mod
2025-05-07T20:22:57.8795984Z dmi_sysfs              20480  0
2025-05-07T20:22:57.8796246Z crc32_pclmul           16384  0
2025-05-07T20:22:57.8796512Z crc32c_intel           24576  0
2025-05-07T20:22:57.8796781Z efivarfs               24576  1
2025-05-07T20:22:57.8797035Z + modinfo nvidia
2025-05-07T20:22:57.8802586Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:57.8803090Z import_ns:      DMA_BUF
2025-05-07T20:22:57.8803355Z alias:          char-major-195-*
2025-05-07T20:22:57.8803635Z version:        570.133.07
2025-05-07T20:22:57.8803904Z supported:      external
2025-05-07T20:22:57.8804255Z license:        Dual MIT/GPL
2025-05-07T20:22:57.8804593Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:57.8804945Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:57.8805630Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:57.8806068Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:57.8806427Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:57.8806768Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:57.8807093Z depends:        i2c-core,drm
2025-05-07T20:22:57.8807361Z retpoline:      Y
2025-05-07T20:22:57.8807585Z name:           nvidia
2025-05-07T20:22:57.8807961Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:57.8808452Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:57.8808906Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:57.8809446Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:57.8809772Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:57.8810095Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:57.8810435Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:57.8810754Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:57.8811071Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:57.8811442Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:57.8811843Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:57.8812196Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:57.8812503Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:57.8812822Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:57.8813197Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:57.8813602Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:57.8813994Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:57.8814433Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.8814856Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:57.8815289Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:57.8815720Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:57.8816079Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:57.8816461Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:57.8816848Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:57.8817207Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:57.8817540Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:57.8817889Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:57.8818229Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:57.8818555Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:57.8818914Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:57.8819297Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:57.8819642Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:57.8819991Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:57.8820356Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:57.8820710Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:57.8821061Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:57.8821411Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:57.8821718Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:57.8822078Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:57.8822414Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:57.8822745Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:57.8823092Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:57.8823462Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:57.8824100Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:57.8824454Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:57.8824809Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:57.8825163Z parm:           rm_firmware_active:charp
2025-05-07T20:22:57.8825618Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:22:57.8825876Z ++ command -v nvidia-smi
2025-05-07T20:22:57.8826145Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:22:57.8826412Z + set +e
2025-05-07T20:22:57.8826737Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:22:59.6862805Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:22:59.6863497Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:22:59.6863996Z + '[' 0 -ne 0 ']'
2025-05-07T20:22:59.6864446Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:22:59.6865007Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:22:59.6865905Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:22:59.6866865Z + set -e
2025-05-07T20:22:59.6867802Z + '[' 1 -eq 0 ']'
2025-05-07T20:22:59.6868616Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:22:59.6869584Z + post_install_nvidia_driver_common
2025-05-07T20:22:59.6870107Z + sudo modprobe nvidia
2025-05-07T20:22:59.8125095Z + echo 'After installing NVIDIA driver'
2025-05-07T20:22:59.8125420Z + lspci
2025-05-07T20:22:59.8125777Z After installing NVIDIA driver
2025-05-07T20:22:59.8244449Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:22:59.8244978Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:22:59.8245556Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:22:59.8246093Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:22:59.8246610Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:22:59.8247160Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:22:59.8247683Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:22:59.8248175Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:22:59.8248604Z + lsmod
2025-05-07T20:22:59.8278347Z Module                  Size  Used by
2025-05-07T20:22:59.8278682Z nvidia_uvm           1884160  0
2025-05-07T20:22:59.8278965Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:22:59.8279275Z drm                   602112  1 nvidia
2025-05-07T20:22:59.8279599Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:22:59.8279939Z backlight              24576  1 drm
2025-05-07T20:22:59.8280257Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:22:59.8280567Z xt_conntrack           16384  1
2025-05-07T20:22:59.8280834Z nft_chain_nat          16384  3
2025-05-07T20:22:59.8281108Z xt_MASQUERADE          20480  1
2025-05-07T20:22:59.8281422Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:22:59.8281770Z nf_conntrack_netlink    57344  0
2025-05-07T20:22:59.8282185Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:22:59.8282636Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:22:59.8282968Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:22:59.8283274Z xfrm_user              57344  1
2025-05-07T20:22:59.8283558Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:22:59.8283854Z xt_addrtype            16384  2
2025-05-07T20:22:59.8284126Z nft_compat             20480  4
2025-05-07T20:22:59.8284453Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:22:59.8284883Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:22:59.8285269Z br_netfilter           36864  0
2025-05-07T20:22:59.8285562Z bridge                323584  1 br_netfilter
2025-05-07T20:22:59.8285873Z stp                    16384  1 bridge
2025-05-07T20:22:59.8286168Z llc                    16384  2 bridge,stp
2025-05-07T20:22:59.8286474Z overlay               167936  0
2025-05-07T20:22:59.8286740Z tls                   135168  0
2025-05-07T20:22:59.8286997Z nls_ascii              16384  1
2025-05-07T20:22:59.8288997Z nls_cp437              20480  1
2025-05-07T20:22:59.8289276Z vfat                   24576  1
2025-05-07T20:22:59.8289535Z fat                    86016  1 vfat
2025-05-07T20:22:59.8289821Z sunrpc                696320  1
2025-05-07T20:22:59.8290091Z ena                   180224  0
2025-05-07T20:22:59.8290351Z i8042                  45056  0
2025-05-07T20:22:59.8290614Z serio                  28672  3 i8042
2025-05-07T20:22:59.8290906Z ghash_clmulni_intel    16384  0
2025-05-07T20:22:59.8291185Z button                 24576  0
2025-05-07T20:22:59.8291448Z sch_fq_codel           20480  17
2025-05-07T20:22:59.8291718Z dm_mod                188416  0
2025-05-07T20:22:59.8291979Z fuse                  163840  1
2025-05-07T20:22:59.8292236Z loop                   36864  0
2025-05-07T20:22:59.8292653Z configfs               57344  1
2025-05-07T20:22:59.8292923Z dax                    45056  1 dm_mod
2025-05-07T20:22:59.8293202Z dmi_sysfs              20480  0
2025-05-07T20:22:59.8293465Z crc32_pclmul           16384  0
2025-05-07T20:22:59.8293739Z crc32c_intel           24576  0
2025-05-07T20:22:59.8293998Z efivarfs               24576  1
2025-05-07T20:22:59.8294257Z + modinfo nvidia
2025-05-07T20:22:59.8295088Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:22:59.8295785Z import_ns:      DMA_BUF
2025-05-07T20:22:59.8296148Z alias:          char-major-195-*
2025-05-07T20:22:59.8296443Z version:        570.133.07
2025-05-07T20:22:59.8296703Z supported:      external
2025-05-07T20:22:59.8296963Z license:        Dual MIT/GPL
2025-05-07T20:22:59.8297270Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:22:59.8297629Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:22:59.8297963Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:22:59.8298307Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:22:59.8298662Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:22:59.8299012Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:22:59.8299335Z depends:        i2c-core,drm
2025-05-07T20:22:59.8299603Z retpoline:      Y
2025-05-07T20:22:59.8299834Z name:           nvidia
2025-05-07T20:22:59.8300207Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:22:59.8300702Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:22:59.8301163Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:22:59.8301599Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:22:59.8301917Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:22:59.8302236Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:22:59.8302568Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:22:59.8302878Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:22:59.8303202Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:22:59.8303587Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:22:59.8303990Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:22:59.8304342Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:22:59.8304661Z parm:           NVreg_EnableMSI:int
2025-05-07T20:22:59.8304975Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:22:59.8305358Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:22:59.8305776Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:22:59.8306172Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:22:59.8306599Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.8307025Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:22:59.8307465Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:22:59.8307892Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:22:59.8308244Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:22:59.8308629Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:22:59.8309162Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:22:59.8309518Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:22:59.8309854Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:22:59.8310200Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:22:59.8310531Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:22:59.8310855Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:22:59.8311217Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:22:59.8311593Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:22:59.8311938Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:22:59.8312296Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:22:59.8312650Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:22:59.8313092Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:22:59.8313450Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:22:59.8313934Z parm:           NVreg_RmMsg:charp
2025-05-07T20:22:59.8314241Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:22:59.8314591Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:22:59.8314931Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:22:59.8315254Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:22:59.8315599Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:22:59.8315974Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:22:59.8316334Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:22:59.8316679Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:22:59.8317043Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:22:59.8317403Z parm:           rm_firmware_active:charp
2025-05-07T20:22:59.8317697Z + set +e
2025-05-07T20:22:59.8317907Z + nvidia-smi
2025-05-07T20:23:01.2265526Z Wed May  7 20:23:01 2025       
2025-05-07T20:23:01.2265950Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.2266517Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:01.2267021Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.2267527Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:01.2268073Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:01.2268522Z |                                         |                        |               MIG M. |
2025-05-07T20:23:01.2268872Z |=========================================+========================+======================|
2025-05-07T20:23:01.2329704Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:01.2330194Z |  0%   29C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:01.2330603Z |                                         |                        |                  N/A |
2025-05-07T20:23:01.2331013Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:01.2331416Z                                                                                          
2025-05-07T20:23:01.2331829Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.2332277Z | Processes:                                                                              |
2025-05-07T20:23:01.2332734Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:01.2333158Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:01.2333525Z |=========================================================================================|
2025-05-07T20:23:01.2334547Z |  No running processes found                                                             |
2025-05-07T20:23:01.2335417Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:01.6457087Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:03.0325966Z NVIDIA A10G
2025-05-07T20:23:03.2951463Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:03.2951743Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:03.2951998Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:03.2952304Z + set -e
2025-05-07T20:23:03.2952530Z INFO: Ignoring allowed status 0
2025-05-07T20:23:03.2961133Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:03.2964246Z + sudo yum install -y yum-utils
2025-05-07T20:23:03.7216289Z Last metadata expiration check: 0:05:54 ago on Wed May  7 20:17:09 2025.
2025-05-07T20:23:03.7467439Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:03.7868794Z Dependencies resolved.
2025-05-07T20:23:03.8051548Z Nothing to do.
2025-05-07T20:23:03.8051799Z Complete!
2025-05-07T20:23:03.8437623Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:03.8438263Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:03.8439143Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.1340352Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:04.1913875Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:04.7483623Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:04.7736681Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:04.8140991Z Dependencies resolved.
2025-05-07T20:23:04.8320328Z ================================================================================
2025-05-07T20:23:04.8320948Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:04.8321507Z ================================================================================
2025-05-07T20:23:04.8321858Z Downgrading:
2025-05-07T20:23:04.8322231Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:04.8322837Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:04.8323279Z 
2025-05-07T20:23:04.8323426Z Transaction Summary
2025-05-07T20:23:04.8323975Z ================================================================================
2025-05-07T20:23:04.8324436Z Downgrade  2 Packages
2025-05-07T20:23:04.8324617Z 
2025-05-07T20:23:04.8324761Z Total download size: 6.8 M
2025-05-07T20:23:04.8325145Z Downloading Packages:
2025-05-07T20:23:04.8741766Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  31 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:04.9175838Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  67 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:04.9184854Z --------------------------------------------------------------------------------
2025-05-07T20:23:04.9188120Z Total                                            80 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:04.9190502Z Running transaction check
2025-05-07T20:23:04.9297204Z Transaction check succeeded.
2025-05-07T20:23:04.9297640Z Running transaction test
2025-05-07T20:23:04.9593273Z Transaction test succeeded.
2025-05-07T20:23:04.9595394Z Running transaction
2025-05-07T20:23:05.5130644Z   Preparing        :                                                        1/1 
2025-05-07T20:23:05.6195516Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:05.6217229Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.6427866Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:05.6428489Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.6538036Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:05.6560105Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:07.1212316Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:07.1212947Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:07.1213510Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:07.1214066Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:07.2584752Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:07.2585693Z WARNING:
2025-05-07T20:23:07.2585956Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:07.2586197Z 
2025-05-07T20:23:07.2586301Z   Available Versions:
2025-05-07T20:23:07.2586456Z 
2025-05-07T20:23:07.2586563Z   Version 2023.7.20250331:
2025-05-07T20:23:07.2586895Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:07.2587163Z 
2025-05-07T20:23:07.2587292Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:07.2587512Z 
2025-05-07T20:23:07.2587609Z     Release notes:
2025-05-07T20:23:07.2588053Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:07.2588440Z 
2025-05-07T20:23:07.2588536Z   Version 2023.7.20250414:
2025-05-07T20:23:07.2588867Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:07.2589127Z 
2025-05-07T20:23:07.2589254Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:07.2589473Z 
2025-05-07T20:23:07.2589570Z     Release notes:
2025-05-07T20:23:07.2589981Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:07.2590366Z 
2025-05-07T20:23:07.2590460Z   Version 2023.7.20250428:
2025-05-07T20:23:07.2590792Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:07.2591052Z 
2025-05-07T20:23:07.2591173Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:07.2591402Z 
2025-05-07T20:23:07.2591492Z     Release notes:
2025-05-07T20:23:07.2600729Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:07.2601159Z 
2025-05-07T20:23:07.2601287Z ================================================================================
2025-05-07T20:23:07.2954378Z  
2025-05-07T20:23:07.2954538Z 
2025-05-07T20:23:07.2954636Z Downgraded:
2025-05-07T20:23:07.2955035Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:07.2955623Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:07.2955999Z 
2025-05-07T20:23:07.2956090Z Complete!
2025-05-07T20:23:07.3395256Z + sudo systemctl restart docker
2025-05-07T20:23:11.3420890Z Wed May  7 20:23:11 2025       
2025-05-07T20:23:11.3421344Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.3421874Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:11.3422371Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.3422884Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:11.3423566Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:11.3424239Z |                                         |                        |               MIG M. |
2025-05-07T20:23:11.3424586Z |=========================================+========================+======================|
2025-05-07T20:23:11.3503358Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:11.3504711Z |  0%   29C    P0             63W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:11.3505126Z |                                         |                        |                  N/A |
2025-05-07T20:23:11.3505534Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:11.3505947Z                                                                                          
2025-05-07T20:23:11.3506508Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:11.3507111Z | Processes:                                                                              |
2025-05-07T20:23:11.3507578Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:11.3508237Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:11.3508606Z |=========================================================================================|
2025-05-07T20:23:11.3509053Z |  No running processes found                                                             |
2025-05-07T20:23:11.3509536Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:12.2444947Z Command completed after 1 attempt(s).
2025-05-07T20:23:12.2536834Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:12.2537353Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:12.2552450Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.2552829Z env:
2025-05-07T20:23:12.2553084Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.2553408Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.2553784Z   BUILD_TARGET: genai
2025-05-07T20:23:12.2554050Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.2554305Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:12.2554592Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.2554928Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.2555282Z ##[endgroup]
2025-05-07T20:23:12.5955250Z ################################################################################
2025-05-07T20:23:12.5955615Z # Print System Info
2025-05-07T20:23:12.5955848Z #
2025-05-07T20:23:12.5970599Z # [2025-05-07T20:23:12.596Z] + print_system_info 
2025-05-07T20:23:12.5970960Z ################################################################################
2025-05-07T20:23:12.5971195Z 
2025-05-07T20:23:12.5971313Z ################################################################################
2025-05-07T20:23:12.5971662Z [INFO] Printing environment variables ...
2025-05-07T20:23:12.5971965Z + printenv
2025-05-07T20:23:12.5972094Z 
2025-05-07T20:23:12.5991178Z SHELL=/bin/bash
2025-05-07T20:23:12.5991577Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:12.5991987Z BUILD_VARIANT=cuda
2025-05-07T20:23:12.5992518Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_c0e31a38-2415-4935-9bb9-cb592934615a
2025-05-07T20:23:12.5993117Z GITHUB_ACTION=__run
2025-05-07T20:23:12.5993424Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.5993892Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:12.5994191Z RUNNER_NAME=i-09c05d8e2aea2c844
2025-05-07T20:23:12.5994506Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:12.5994819Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:12.5995089Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:12.5995474Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:12.5995922Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:12.5996211Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:12.5996518Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:12.5996996Z ***
2025-05-07T20:23:12.5997207Z LOGNAME=ec2-user
2025-05-07T20:23:12.5997445Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:12.5997720Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:12.5997969Z GITHUB_ACTIONS=true
2025-05-07T20:23:12.5998197Z SYSTEMD_EXEC_PID=55565
2025-05-07T20:23:12.5998495Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:12.5999052Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:12.5999575Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:12.5999857Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:12.6000129Z RUNNER_OS=Linux
2025-05-07T20:23:12.6000363Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:12.6000616Z HOME=/home/ec2-user
2025-05-07T20:23:12.6000879Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:12.6001181Z LANG=C.UTF-8
2025-05-07T20:23:12.6001495Z RUNNER_TRACKING_ID=github_dcbe1522-ed89-4cf6-b5ae-52b4cb845c2e
2025-05-07T20:23:12.6001861Z RUNNER_ARCH=X64
2025-05-07T20:23:12.6002153Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:12.6002849Z BUILD_TARGET=genai
2025-05-07T20:23:12.6003385Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_c0e31a38-2415-4935-9bb9-cb592934615a
2025-05-07T20:23:12.6004298Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_c0e31a38-2415-4935-9bb9-cb592934615a
2025-05-07T20:23:12.6005048Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:12.6005726Z INVOCATION_ID=e7feec3cfd9b4570a4e9feb57496356b
2025-05-07T20:23:12.6006064Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:12.6006341Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:12.6006937Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_c0e31a38-2415-4935-9bb9-cb592934615a
2025-05-07T20:23:12.6007556Z BUILD_ENV=build_binary
2025-05-07T20:23:12.6007796Z GITHUB_ACTOR=q10
2025-05-07T20:23:12.6008021Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:12.6008254Z KERN_NAME_LC=linux
2025-05-07T20:23:12.6008488Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:12.6008801Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:12.6009154Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:12.6009404Z USER=ec2-user
2025-05-07T20:23:12.6009647Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:12.6009939Z SHLVL=1
2025-05-07T20:23:12.6010135Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:12.6010460Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:12.6010920Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:12.6011285Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:12.6011538Z KERN_NAME=Linux
2025-05-07T20:23:12.6011780Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:12.6012196Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:12.6012636Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:12.6012925Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:12.6013174Z JOURNAL_STREAM=8:86301
2025-05-07T20:23:12.6013514Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:12.6013946Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:12.6014353Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:12.6014774Z GITHUB_BASE_REF=main
2025-05-07T20:23:12.6015059Z CI=true
2025-05-07T20:23:12.6015333Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:12.6015692Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:12.6016054Z GITHUB_ACTION_REF=
2025-05-07T20:23:12.6016340Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:12.6016968Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_c0e31a38-2415-4935-9bb9-cb592934615a
2025-05-07T20:23:12.6017571Z MACHINE_NAME=x86_64
2025-05-07T20:23:12.6017804Z _=/usr/bin/printenv
2025-05-07T20:23:12.6017942Z 
2025-05-07T20:23:12.6018070Z ################################################################################
2025-05-07T20:23:12.6018396Z [INFO] Print ldd version ...
2025-05-07T20:23:12.6018672Z + ldd --version
2025-05-07T20:23:12.6018805Z 
2025-05-07T20:23:12.6018912Z ldd (GNU libc) 2.34
2025-05-07T20:23:12.6019187Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:12.6019649Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:12.6020201Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:12.6020667Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:12.6020893Z 
2025-05-07T20:23:12.6021016Z ################################################################################
2025-05-07T20:23:12.6021343Z [INFO] Print CPU info ...
2025-05-07T20:23:12.6021597Z + nproc
2025-05-07T20:23:12.6021710Z 
2025-05-07T20:23:12.6034729Z 16
2025-05-07T20:23:12.6036520Z 
2025-05-07T20:23:12.6037178Z + lscpu
2025-05-07T20:23:12.6037343Z 
2025-05-07T20:23:12.6147666Z Architecture:                         x86_64
2025-05-07T20:23:12.6148075Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:12.6148998Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6149412Z Byte Order:                           Little Endian
2025-05-07T20:23:12.6149745Z CPU(s):                               16
2025-05-07T20:23:12.6150051Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:12.6150387Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:12.6150745Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:12.6151071Z CPU family:                           23
2025-05-07T20:23:12.6151572Z Model:                                49
2025-05-07T20:23:12.6151882Z Thread(s) per core:                   2
2025-05-07T20:23:12.6152190Z Core(s) per socket:                   8
2025-05-07T20:23:12.6152483Z Socket(s):                            1
2025-05-07T20:23:12.6152775Z Stepping:                             0
2025-05-07T20:23:12.6153093Z BogoMIPS:                             5600.08
2025-05-07T20:23:12.6155349Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6157599Z Hypervisor vendor:                    KVM
2025-05-07T20:23:12.6157924Z Virtualization type:                  full
2025-05-07T20:23:12.6158283Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.6158668Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:12.6159047Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:12.6159417Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:12.6159754Z NUMA node(s):                         1
2025-05-07T20:23:12.6160066Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:12.6160413Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:12.6160803Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:12.6161182Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:12.6161544Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:12.6161920Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:12.6162333Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:12.6162722Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:12.6163288Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:12.6163903Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:12.6164512Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:12.6165248Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:12.6166184Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:12.6166886Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:12.6167279Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:12.6167612Z 
2025-05-07T20:23:12.6167710Z + cat /proc/cpuinfo
2025-05-07T20:23:12.6167854Z 
2025-05-07T20:23:12.6167955Z processor	: 0
2025-05-07T20:23:12.6168184Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6168447Z cpu family	: 23
2025-05-07T20:23:12.6168673Z model		: 49
2025-05-07T20:23:12.6168890Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6169150Z stepping	: 0
2025-05-07T20:23:12.6169376Z microcode	: 0x830107f
2025-05-07T20:23:12.6169712Z cpu MHz		: 3100.881
2025-05-07T20:23:12.6169941Z cache size	: 512 KB
2025-05-07T20:23:12.6170170Z physical id	: 0
2025-05-07T20:23:12.6170384Z siblings	: 16
2025-05-07T20:23:12.6170597Z core id		: 0
2025-05-07T20:23:12.6170809Z cpu cores	: 8
2025-05-07T20:23:12.6171015Z apicid		: 0
2025-05-07T20:23:12.6171230Z initial apicid	: 0
2025-05-07T20:23:12.6171459Z fpu		: yes
2025-05-07T20:23:12.6171663Z fpu_exception	: yes
2025-05-07T20:23:12.6171891Z cpuid level	: 13
2025-05-07T20:23:12.6172112Z wp		: yes
2025-05-07T20:23:12.6174249Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6176563Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6177068Z bogomips	: 5600.08
2025-05-07T20:23:12.6177304Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6177557Z clflush size	: 64
2025-05-07T20:23:12.6177782Z cache_alignment	: 64
2025-05-07T20:23:12.6178073Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6178411Z power management:
2025-05-07T20:23:12.6178553Z 
2025-05-07T20:23:12.6178640Z processor	: 1
2025-05-07T20:23:12.6178871Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6179127Z cpu family	: 23
2025-05-07T20:23:12.6179347Z model		: 49
2025-05-07T20:23:12.6179563Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6179820Z stepping	: 0
2025-05-07T20:23:12.6180049Z microcode	: 0x830107f
2025-05-07T20:23:12.6180286Z cpu MHz		: 3073.173
2025-05-07T20:23:12.6180515Z cache size	: 512 KB
2025-05-07T20:23:12.6180753Z physical id	: 0
2025-05-07T20:23:12.6180970Z siblings	: 16
2025-05-07T20:23:12.6181187Z core id		: 1
2025-05-07T20:23:12.6181401Z cpu cores	: 8
2025-05-07T20:23:12.6181610Z apicid		: 2
2025-05-07T20:23:12.6181822Z initial apicid	: 2
2025-05-07T20:23:12.6182047Z fpu		: yes
2025-05-07T20:23:12.6182253Z fpu_exception	: yes
2025-05-07T20:23:12.6182483Z cpuid level	: 13
2025-05-07T20:23:12.6182705Z wp		: yes
2025-05-07T20:23:12.6184722Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6187032Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6187549Z bogomips	: 5600.08
2025-05-07T20:23:12.6187784Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6188028Z clflush size	: 64
2025-05-07T20:23:12.6188260Z cache_alignment	: 64
2025-05-07T20:23:12.6188547Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6188883Z power management:
2025-05-07T20:23:12.6189020Z 
2025-05-07T20:23:12.6189113Z processor	: 2
2025-05-07T20:23:12.6189343Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6189597Z cpu family	: 23
2025-05-07T20:23:12.6189807Z model		: 49
2025-05-07T20:23:12.6190030Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6190286Z stepping	: 0
2025-05-07T20:23:12.6190503Z microcode	: 0x830107f
2025-05-07T20:23:12.6190741Z cpu MHz		: 3265.151
2025-05-07T20:23:12.6190967Z cache size	: 512 KB
2025-05-07T20:23:12.6191188Z physical id	: 0
2025-05-07T20:23:12.6191409Z siblings	: 16
2025-05-07T20:23:12.6191714Z core id		: 2
2025-05-07T20:23:12.6191918Z cpu cores	: 8
2025-05-07T20:23:12.6192136Z apicid		: 4
2025-05-07T20:23:12.6192351Z initial apicid	: 4
2025-05-07T20:23:12.6192573Z fpu		: yes
2025-05-07T20:23:12.6192788Z fpu_exception	: yes
2025-05-07T20:23:12.6193017Z cpuid level	: 13
2025-05-07T20:23:12.6193234Z wp		: yes
2025-05-07T20:23:12.6195398Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6197698Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6198217Z bogomips	: 5600.08
2025-05-07T20:23:12.6198444Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6198697Z clflush size	: 64
2025-05-07T20:23:12.6198932Z cache_alignment	: 64
2025-05-07T20:23:12.6199224Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6199550Z power management:
2025-05-07T20:23:12.6199696Z 
2025-05-07T20:23:12.6199784Z processor	: 3
2025-05-07T20:23:12.6200012Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6200264Z cpu family	: 23
2025-05-07T20:23:12.6200482Z model		: 49
2025-05-07T20:23:12.6200702Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6200949Z stepping	: 0
2025-05-07T20:23:12.6201170Z microcode	: 0x830107f
2025-05-07T20:23:12.6201410Z cpu MHz		: 3094.769
2025-05-07T20:23:12.6201630Z cache size	: 512 KB
2025-05-07T20:23:12.6201856Z physical id	: 0
2025-05-07T20:23:12.6202079Z siblings	: 16
2025-05-07T20:23:12.6202287Z core id		: 3
2025-05-07T20:23:12.6202504Z cpu cores	: 8
2025-05-07T20:23:12.6202730Z apicid		: 6
2025-05-07T20:23:12.6202941Z initial apicid	: 6
2025-05-07T20:23:12.6203159Z fpu		: yes
2025-05-07T20:23:12.6203372Z fpu_exception	: yes
2025-05-07T20:23:12.6203609Z cpuid level	: 13
2025-05-07T20:23:12.6203824Z wp		: yes
2025-05-07T20:23:12.6205833Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6208127Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6208646Z bogomips	: 5600.08
2025-05-07T20:23:12.6208882Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6209126Z clflush size	: 64
2025-05-07T20:23:12.6219295Z cache_alignment	: 64
2025-05-07T20:23:12.6219638Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6219978Z power management:
2025-05-07T20:23:12.6220132Z 
2025-05-07T20:23:12.6220223Z processor	: 4
2025-05-07T20:23:12.6220461Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6220715Z cpu family	: 23
2025-05-07T20:23:12.6220940Z model		: 49
2025-05-07T20:23:12.6221174Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6221437Z stepping	: 0
2025-05-07T20:23:12.6221660Z microcode	: 0x830107f
2025-05-07T20:23:12.6221915Z cpu MHz		: 3259.383
2025-05-07T20:23:12.6222150Z cache size	: 512 KB
2025-05-07T20:23:12.6222377Z physical id	: 0
2025-05-07T20:23:12.6222604Z siblings	: 16
2025-05-07T20:23:12.6222824Z core id		: 4
2025-05-07T20:23:12.6223037Z cpu cores	: 8
2025-05-07T20:23:12.6223253Z apicid		: 8
2025-05-07T20:23:12.6223620Z initial apicid	: 8
2025-05-07T20:23:12.6224097Z fpu		: yes
2025-05-07T20:23:12.6224405Z fpu_exception	: yes
2025-05-07T20:23:12.6224722Z cpuid level	: 13
2025-05-07T20:23:12.6224995Z wp		: yes
2025-05-07T20:23:12.6227199Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6229493Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6230010Z bogomips	: 5600.08
2025-05-07T20:23:12.6230254Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6230513Z clflush size	: 64
2025-05-07T20:23:12.6230749Z cache_alignment	: 64
2025-05-07T20:23:12.6231035Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6231380Z power management:
2025-05-07T20:23:12.6231532Z 
2025-05-07T20:23:12.6231622Z processor	: 5
2025-05-07T20:23:12.6231858Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6232112Z cpu family	: 23
2025-05-07T20:23:12.6232338Z model		: 49
2025-05-07T20:23:12.6232563Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6232826Z stepping	: 0
2025-05-07T20:23:12.6233053Z microcode	: 0x830107f
2025-05-07T20:23:12.6233297Z cpu MHz		: 2906.026
2025-05-07T20:23:12.6233587Z cache size	: 512 KB
2025-05-07T20:23:12.6233820Z physical id	: 0
2025-05-07T20:23:12.6234043Z siblings	: 16
2025-05-07T20:23:12.6234254Z core id		: 5
2025-05-07T20:23:12.6234465Z cpu cores	: 8
2025-05-07T20:23:12.6234679Z apicid		: 10
2025-05-07T20:23:12.6234893Z initial apicid	: 10
2025-05-07T20:23:12.6235121Z fpu		: yes
2025-05-07T20:23:12.6235345Z fpu_exception	: yes
2025-05-07T20:23:12.6235570Z cpuid level	: 13
2025-05-07T20:23:12.6235799Z wp		: yes
2025-05-07T20:23:12.6237806Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6240096Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6240613Z bogomips	: 5600.08
2025-05-07T20:23:12.6240845Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6241097Z clflush size	: 64
2025-05-07T20:23:12.6241338Z cache_alignment	: 64
2025-05-07T20:23:12.6241623Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6241957Z power management:
2025-05-07T20:23:12.6242097Z 
2025-05-07T20:23:12.6242194Z processor	: 6
2025-05-07T20:23:12.6242420Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6242678Z cpu family	: 23
2025-05-07T20:23:12.6242901Z model		: 49
2025-05-07T20:23:12.6243123Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6243384Z stepping	: 0
2025-05-07T20:23:12.6243615Z microcode	: 0x830107f
2025-05-07T20:23:12.6243858Z cpu MHz		: 3205.385
2025-05-07T20:23:12.6244091Z cache size	: 512 KB
2025-05-07T20:23:12.6244325Z physical id	: 0
2025-05-07T20:23:12.6244546Z siblings	: 16
2025-05-07T20:23:12.6244768Z core id		: 6
2025-05-07T20:23:12.6244986Z cpu cores	: 8
2025-05-07T20:23:12.6245197Z apicid		: 12
2025-05-07T20:23:12.6245420Z initial apicid	: 12
2025-05-07T20:23:12.6245647Z fpu		: yes
2025-05-07T20:23:12.6245853Z fpu_exception	: yes
2025-05-07T20:23:12.6246086Z cpuid level	: 13
2025-05-07T20:23:12.6246445Z wp		: yes
2025-05-07T20:23:12.6248533Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6250813Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6251318Z bogomips	: 5600.08
2025-05-07T20:23:12.6251555Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6251805Z clflush size	: 64
2025-05-07T20:23:12.6252033Z cache_alignment	: 64
2025-05-07T20:23:12.6252333Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6252672Z power management:
2025-05-07T20:23:12.6252811Z 
2025-05-07T20:23:12.6252900Z processor	: 7
2025-05-07T20:23:12.6253134Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6253388Z cpu family	: 23
2025-05-07T20:23:12.6253604Z model		: 49
2025-05-07T20:23:12.6253828Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6254084Z stepping	: 0
2025-05-07T20:23:12.6254309Z microcode	: 0x830107f
2025-05-07T20:23:12.6254543Z cpu MHz		: 3270.817
2025-05-07T20:23:12.6254781Z cache size	: 512 KB
2025-05-07T20:23:12.6255010Z physical id	: 0
2025-05-07T20:23:12.6255232Z siblings	: 16
2025-05-07T20:23:12.6255446Z core id		: 7
2025-05-07T20:23:12.6255660Z cpu cores	: 8
2025-05-07T20:23:12.6255868Z apicid		: 14
2025-05-07T20:23:12.6256092Z initial apicid	: 14
2025-05-07T20:23:12.6256320Z fpu		: yes
2025-05-07T20:23:12.6256530Z fpu_exception	: yes
2025-05-07T20:23:12.6256761Z cpuid level	: 13
2025-05-07T20:23:12.6256984Z wp		: yes
2025-05-07T20:23:12.6259007Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6261304Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6261816Z bogomips	: 5600.08
2025-05-07T20:23:12.6262054Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6262302Z clflush size	: 64
2025-05-07T20:23:12.6262536Z cache_alignment	: 64
2025-05-07T20:23:12.6262817Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6263161Z power management:
2025-05-07T20:23:12.6263299Z 
2025-05-07T20:23:12.6263396Z processor	: 8
2025-05-07T20:23:12.6263636Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6263913Z cpu family	: 23
2025-05-07T20:23:12.6264156Z model		: 49
2025-05-07T20:23:12.6264381Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6264644Z stepping	: 0
2025-05-07T20:23:12.6264872Z microcode	: 0x830107f
2025-05-07T20:23:12.6265109Z cpu MHz		: 3163.691
2025-05-07T20:23:12.6265344Z cache size	: 512 KB
2025-05-07T20:23:12.6265578Z physical id	: 0
2025-05-07T20:23:12.6265814Z siblings	: 16
2025-05-07T20:23:12.6266027Z core id		: 0
2025-05-07T20:23:12.6266250Z cpu cores	: 8
2025-05-07T20:23:12.6266470Z apicid		: 1
2025-05-07T20:23:12.6266680Z initial apicid	: 1
2025-05-07T20:23:12.6266917Z fpu		: yes
2025-05-07T20:23:12.6267136Z fpu_exception	: yes
2025-05-07T20:23:12.6267364Z cpuid level	: 13
2025-05-07T20:23:12.6267592Z wp		: yes
2025-05-07T20:23:12.6269605Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6272071Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6272580Z bogomips	: 5600.08
2025-05-07T20:23:12.6272822Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6273077Z clflush size	: 64
2025-05-07T20:23:12.6273309Z cache_alignment	: 64
2025-05-07T20:23:12.6273664Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6274009Z power management:
2025-05-07T20:23:12.6274151Z 
2025-05-07T20:23:12.6274255Z processor	: 9
2025-05-07T20:23:12.6274482Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6274740Z cpu family	: 23
2025-05-07T20:23:12.6274963Z model		: 49
2025-05-07T20:23:12.6275180Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6275442Z stepping	: 0
2025-05-07T20:23:12.6275666Z microcode	: 0x830107f
2025-05-07T20:23:12.6275906Z cpu MHz		: 2976.806
2025-05-07T20:23:12.6276139Z cache size	: 512 KB
2025-05-07T20:23:12.6276430Z physical id	: 0
2025-05-07T20:23:12.6276663Z siblings	: 16
2025-05-07T20:23:12.6276899Z core id		: 1
2025-05-07T20:23:12.6277136Z cpu cores	: 8
2025-05-07T20:23:12.6277350Z apicid		: 3
2025-05-07T20:23:12.6277571Z initial apicid	: 3
2025-05-07T20:23:12.6277801Z fpu		: yes
2025-05-07T20:23:12.6278010Z fpu_exception	: yes
2025-05-07T20:23:12.6278246Z cpuid level	: 13
2025-05-07T20:23:12.6278469Z wp		: yes
2025-05-07T20:23:12.6280480Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6282777Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6283291Z bogomips	: 5600.08
2025-05-07T20:23:12.6283531Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6283787Z clflush size	: 64
2025-05-07T20:23:12.6284017Z cache_alignment	: 64
2025-05-07T20:23:12.6284313Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6284650Z power management:
2025-05-07T20:23:12.6284792Z 
2025-05-07T20:23:12.6284883Z processor	: 10
2025-05-07T20:23:12.6285123Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6285393Z cpu family	: 23
2025-05-07T20:23:12.6285604Z model		: 49
2025-05-07T20:23:12.6285826Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6286083Z stepping	: 0
2025-05-07T20:23:12.6286296Z microcode	: 0x830107f
2025-05-07T20:23:12.6286541Z cpu MHz		: 3114.815
2025-05-07T20:23:12.6286767Z cache size	: 512 KB
2025-05-07T20:23:12.6286990Z physical id	: 0
2025-05-07T20:23:12.6287212Z siblings	: 16
2025-05-07T20:23:12.6287424Z core id		: 2
2025-05-07T20:23:12.6287626Z cpu cores	: 8
2025-05-07T20:23:12.6287838Z apicid		: 5
2025-05-07T20:23:12.6288056Z initial apicid	: 5
2025-05-07T20:23:12.6288274Z fpu		: yes
2025-05-07T20:23:12.6288485Z fpu_exception	: yes
2025-05-07T20:23:12.6288713Z cpuid level	: 13
2025-05-07T20:23:12.6288926Z wp		: yes
2025-05-07T20:23:12.6290927Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6293298Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6293805Z bogomips	: 5600.08
2025-05-07T20:23:12.6294145Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6294394Z clflush size	: 64
2025-05-07T20:23:12.6294622Z cache_alignment	: 64
2025-05-07T20:23:12.6294909Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6295233Z power management:
2025-05-07T20:23:12.6295376Z 
2025-05-07T20:23:12.6295464Z processor	: 11
2025-05-07T20:23:12.6295695Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6295945Z cpu family	: 23
2025-05-07T20:23:12.6296167Z model		: 49
2025-05-07T20:23:12.6296396Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6296646Z stepping	: 0
2025-05-07T20:23:12.6296869Z microcode	: 0x830107f
2025-05-07T20:23:12.6297110Z cpu MHz		: 3207.896
2025-05-07T20:23:12.6297336Z cache size	: 512 KB
2025-05-07T20:23:12.6297559Z physical id	: 0
2025-05-07T20:23:12.6297782Z siblings	: 16
2025-05-07T20:23:12.6297999Z core id		: 3
2025-05-07T20:23:12.6298205Z cpu cores	: 8
2025-05-07T20:23:12.6298417Z apicid		: 7
2025-05-07T20:23:12.6298629Z initial apicid	: 7
2025-05-07T20:23:12.6298850Z fpu		: yes
2025-05-07T20:23:12.6299061Z fpu_exception	: yes
2025-05-07T20:23:12.6299288Z cpuid level	: 13
2025-05-07T20:23:12.6299500Z wp		: yes
2025-05-07T20:23:12.6301504Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6303803Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6304316Z bogomips	: 5600.08
2025-05-07T20:23:12.6304540Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6304797Z clflush size	: 64
2025-05-07T20:23:12.6305029Z cache_alignment	: 64
2025-05-07T20:23:12.6305308Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6305637Z power management:
2025-05-07T20:23:12.6305780Z 
2025-05-07T20:23:12.6305871Z processor	: 12
2025-05-07T20:23:12.6306097Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6306345Z cpu family	: 23
2025-05-07T20:23:12.6306564Z model		: 49
2025-05-07T20:23:12.6306779Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6307034Z stepping	: 0
2025-05-07T20:23:12.6307252Z microcode	: 0x830107f
2025-05-07T20:23:12.6307488Z cpu MHz		: 3272.436
2025-05-07T20:23:12.6307706Z cache size	: 512 KB
2025-05-07T20:23:12.6307935Z physical id	: 0
2025-05-07T20:23:12.6308153Z siblings	: 16
2025-05-07T20:23:12.6308358Z core id		: 4
2025-05-07T20:23:12.6308567Z cpu cores	: 8
2025-05-07T20:23:12.6308780Z apicid		: 9
2025-05-07T20:23:12.6308982Z initial apicid	: 9
2025-05-07T20:23:12.6309210Z fpu		: yes
2025-05-07T20:23:12.6309424Z fpu_exception	: yes
2025-05-07T20:23:12.6309650Z cpuid level	: 13
2025-05-07T20:23:12.6309869Z wp		: yes
2025-05-07T20:23:12.6311871Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6314343Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6314853Z bogomips	: 5600.08
2025-05-07T20:23:12.6315078Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6315325Z clflush size	: 64
2025-05-07T20:23:12.6315555Z cache_alignment	: 64
2025-05-07T20:23:12.6315926Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6316258Z power management:
2025-05-07T20:23:12.6316397Z 
2025-05-07T20:23:12.6316489Z processor	: 13
2025-05-07T20:23:12.6316715Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6316966Z cpu family	: 23
2025-05-07T20:23:12.6317182Z model		: 49
2025-05-07T20:23:12.6317395Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6317655Z stepping	: 0
2025-05-07T20:23:12.6317873Z microcode	: 0x830107f
2025-05-07T20:23:12.6318110Z cpu MHz		: 3213.101
2025-05-07T20:23:12.6318335Z cache size	: 512 KB
2025-05-07T20:23:12.6318560Z physical id	: 0
2025-05-07T20:23:12.6318772Z siblings	: 16
2025-05-07T20:23:12.6318986Z core id		: 5
2025-05-07T20:23:12.6319193Z cpu cores	: 8
2025-05-07T20:23:12.6319397Z apicid		: 11
2025-05-07T20:23:12.6319614Z initial apicid	: 11
2025-05-07T20:23:12.6319837Z fpu		: yes
2025-05-07T20:23:12.6320042Z fpu_exception	: yes
2025-05-07T20:23:12.6320271Z cpuid level	: 13
2025-05-07T20:23:12.6320490Z wp		: yes
2025-05-07T20:23:12.6322508Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6325076Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6325581Z bogomips	: 5600.08
2025-05-07T20:23:12.6325813Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6326061Z clflush size	: 64
2025-05-07T20:23:12.6326284Z cache_alignment	: 64
2025-05-07T20:23:12.6326568Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6326915Z power management:
2025-05-07T20:23:12.6327052Z 
2025-05-07T20:23:12.6327141Z processor	: 14
2025-05-07T20:23:12.6327372Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6327622Z cpu family	: 23
2025-05-07T20:23:12.6327831Z model		: 49
2025-05-07T20:23:12.6328046Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6328296Z stepping	: 0
2025-05-07T20:23:12.6328508Z microcode	: 0x830107f
2025-05-07T20:23:12.6328745Z cpu MHz		: 3276.735
2025-05-07T20:23:12.6328976Z cache size	: 512 KB
2025-05-07T20:23:12.6329198Z physical id	: 0
2025-05-07T20:23:12.6329418Z siblings	: 16
2025-05-07T20:23:12.6329628Z core id		: 6
2025-05-07T20:23:12.6329833Z cpu cores	: 8
2025-05-07T20:23:12.6330047Z apicid		: 13
2025-05-07T20:23:12.6330263Z initial apicid	: 13
2025-05-07T20:23:12.6330481Z fpu		: yes
2025-05-07T20:23:12.6330690Z fpu_exception	: yes
2025-05-07T20:23:12.6330919Z cpuid level	: 13
2025-05-07T20:23:12.6331132Z wp		: yes
2025-05-07T20:23:12.6333154Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6337320Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6337836Z bogomips	: 5600.08
2025-05-07T20:23:12.6338067Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6338318Z clflush size	: 64
2025-05-07T20:23:12.6338551Z cache_alignment	: 64
2025-05-07T20:23:12.6338838Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6339163Z power management:
2025-05-07T20:23:12.6339304Z 
2025-05-07T20:23:12.6339516Z processor	: 15
2025-05-07T20:23:12.6339754Z vendor_id	: AuthenticAMD
2025-05-07T20:23:12.6340019Z cpu family	: 23
2025-05-07T20:23:12.6340268Z model		: 49
2025-05-07T20:23:12.6340496Z model name	: AMD EPYC 7R32
2025-05-07T20:23:12.6340747Z stepping	: 0
2025-05-07T20:23:12.6340973Z microcode	: 0x830107f
2025-05-07T20:23:12.6341216Z cpu MHz		: 3253.540
2025-05-07T20:23:12.6341442Z cache size	: 512 KB
2025-05-07T20:23:12.6341677Z physical id	: 0
2025-05-07T20:23:12.6341914Z siblings	: 16
2025-05-07T20:23:12.6342129Z core id		: 7
2025-05-07T20:23:12.6342347Z cpu cores	: 8
2025-05-07T20:23:12.6342564Z apicid		: 15
2025-05-07T20:23:12.6342782Z initial apicid	: 15
2025-05-07T20:23:12.6343017Z fpu		: yes
2025-05-07T20:23:12.6343243Z fpu_exception	: yes
2025-05-07T20:23:12.6343473Z cpuid level	: 13
2025-05-07T20:23:12.6343700Z wp		: yes
2025-05-07T20:23:12.6345724Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:12.6348016Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:12.6348531Z bogomips	: 5600.08
2025-05-07T20:23:12.6348762Z TLB size	: 3072 4K pages
2025-05-07T20:23:12.6349013Z clflush size	: 64
2025-05-07T20:23:12.6349248Z cache_alignment	: 64
2025-05-07T20:23:12.6349537Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:12.6349877Z power management:
2025-05-07T20:23:12.6350017Z 
2025-05-07T20:23:12.6350021Z 
2025-05-07T20:23:12.6350158Z ################################################################################
2025-05-07T20:23:12.6350482Z [INFO] Print PCI info ...
2025-05-07T20:23:12.6350780Z + lspci -v
2025-05-07T20:23:12.6350913Z 
2025-05-07T20:23:12.6351103Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:12.6351512Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:12.6351858Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:12.6352079Z 
2025-05-07T20:23:12.6352291Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:12.6352694Z 	Physical Slot: 1
2025-05-07T20:23:12.6352952Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.6353164Z 
2025-05-07T20:23:12.6353421Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:12.6353957Z 	Physical Slot: 1
2025-05-07T20:23:12.6354227Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:12.6354461Z 
2025-05-07T20:23:12.6354743Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:12.6355200Z 	Physical Slot: 3
2025-05-07T20:23:12.6355454Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.6355813Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.6356182Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:12.6356420Z 
2025-05-07T20:23:12.6356731Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.6357347Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.6357647Z 	Physical Slot: 4
2025-05-07T20:23:12.6357913Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:12.6358314Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.6358681Z 	Capabilities: <access denied>
2025-05-07T20:23:12.6358958Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.6359134Z 
2025-05-07T20:23:12.6359444Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.6359945Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:12.6360302Z 	Physical Slot: 5
2025-05-07T20:23:12.6360560Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.6360940Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.6361337Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:12.6361675Z 	Capabilities: <access denied>
2025-05-07T20:23:12.6361961Z 	Kernel driver in use: ena
2025-05-07T20:23:12.6362221Z 	Kernel modules: ena
2025-05-07T20:23:12.6362367Z 
2025-05-07T20:23:12.6362559Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:12.6362964Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:12.6363266Z 	Physical Slot: 30
2025-05-07T20:23:12.6363541Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:12.6363939Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:12.6364350Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:12.6364743Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:12.6365090Z 	Capabilities: <access denied>
2025-05-07T20:23:12.6365369Z 	Kernel driver in use: nvidia
2025-05-07T20:23:12.6365643Z 	Kernel modules: nvidia
2025-05-07T20:23:12.6365797Z 
2025-05-07T20:23:12.6366113Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:12.6366656Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:12.6367055Z 	Physical Slot: 31
2025-05-07T20:23:12.6367412Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:12.6367848Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:12.6368534Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:12.6369006Z 	Capabilities: <access denied>
2025-05-07T20:23:12.6379698Z 	Kernel driver in use: nvme
2025-05-07T20:23:12.6379901Z 
2025-05-07T20:23:12.6379905Z 
2025-05-07T20:23:12.6380042Z ################################################################################
2025-05-07T20:23:12.6380403Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:12.6380710Z + uname -a
2025-05-07T20:23:12.6380842Z 
2025-05-07T20:23:12.6381272Z Linux ip-10-0-65-139.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:12.6381801Z 
2025-05-07T20:23:12.6381888Z + uname -m
2025-05-07T20:23:12.6382015Z 
2025-05-07T20:23:12.6382108Z x86_64
2025-05-07T20:23:12.6382223Z 
2025-05-07T20:23:12.6382317Z + cat /proc/version
2025-05-07T20:23:12.6382470Z 
2025-05-07T20:23:12.6383031Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:12.6383682Z 
2025-05-07T20:23:12.6383777Z + cat /etc/os-release
2025-05-07T20:23:12.6383933Z 
2025-05-07T20:23:12.6384053Z NAME="Amazon Linux"
2025-05-07T20:23:12.6384284Z VERSION="2023"
2025-05-07T20:23:12.6384509Z ID="amzn"
2025-05-07T20:23:12.6384719Z ID_LIKE="fedora"
2025-05-07T20:23:12.6384939Z VERSION_ID="2023"
2025-05-07T20:23:12.6385193Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:12.6385500Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:12.6385805Z ANSI_COLOR="0;33"
2025-05-07T20:23:12.6386079Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:12.6386625Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:12.6387085Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:12.6387523Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:12.6387993Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:12.6388392Z VENDOR_NAME="AWS"
2025-05-07T20:23:12.6388648Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:12.6388961Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:12.6389126Z 
2025-05-07T20:23:12.6389353Z ################################################################################
2025-05-07T20:23:12.6389685Z # Print EC2 Instance Info
2025-05-07T20:23:12.6389940Z #
2025-05-07T20:23:12.6390167Z # [2025-05-07T20:23:12.637Z] + print_ec2_info 
2025-05-07T20:23:12.6390505Z ################################################################################
2025-05-07T20:23:12.6390733Z 
2025-05-07T20:23:12.6495034Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:12.6615820Z instance-id: i-09c05d8e2aea2c844
2025-05-07T20:23:12.6731058Z instance-type: g5.4xlarge
2025-05-07T20:23:12.6771202Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:12.6771584Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:12.6780507Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:12.6780887Z env:
2025-05-07T20:23:12.6781130Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:12.6781460Z   BUILD_ENV: build_binary
2025-05-07T20:23:12.6781729Z   BUILD_TARGET: genai
2025-05-07T20:23:12.6781980Z   BUILD_VARIANT: cuda
2025-05-07T20:23:12.6782242Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:12.6782525Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:12.6782859Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:12.6783218Z ##[endgroup]
2025-05-07T20:23:13.0152700Z ################################################################################
2025-05-07T20:23:13.0153097Z [INFO] Printing general display info ...
2025-05-07T20:23:13.0181524Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:13.1329376Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:13.1338318Z /usr/bin/sudo
2025-05-07T20:23:13.1349043Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:13.1359956Z /usr/bin/yum
2025-05-07T20:23:13.1361604Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:13.1382068Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:13.5961711Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:13.6679902Z ================================================================================
2025-05-07T20:23:13.6680431Z WARNING:
2025-05-07T20:23:13.6680778Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:13.6681122Z 
2025-05-07T20:23:13.6681256Z   Available Versions:
2025-05-07T20:23:13.6681468Z 
2025-05-07T20:23:13.6681612Z   Version 2023.7.20250331:
2025-05-07T20:23:13.6681958Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:13.6682248Z 
2025-05-07T20:23:13.6682390Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:13.6682617Z 
2025-05-07T20:23:13.6682710Z     Release notes:
2025-05-07T20:23:13.6683138Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:13.6683516Z 
2025-05-07T20:23:13.6683612Z   Version 2023.7.20250414:
2025-05-07T20:23:13.6683940Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:13.6684204Z 
2025-05-07T20:23:13.6684324Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:13.6684540Z 
2025-05-07T20:23:13.6684636Z     Release notes:
2025-05-07T20:23:13.6685038Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:13.6685415Z 
2025-05-07T20:23:13.6685508Z   Version 2023.7.20250428:
2025-05-07T20:23:13.6685828Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:13.6686319Z 
2025-05-07T20:23:13.6686452Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:13.6686671Z 
2025-05-07T20:23:13.6686763Z     Release notes:
2025-05-07T20:23:13.6687167Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:13.6687538Z 
2025-05-07T20:23:13.6687663Z ================================================================================
2025-05-07T20:23:13.7855113Z Dependencies resolved.
2025-05-07T20:23:13.8143802Z ================================================================================
2025-05-07T20:23:13.8144435Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:13.8144945Z ================================================================================
2025-05-07T20:23:13.8145303Z Upgrading:
2025-05-07T20:23:13.8145732Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:13.8146342Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:13.8146843Z 
2025-05-07T20:23:13.8147356Z Transaction Summary
2025-05-07T20:23:13.8147730Z ================================================================================
2025-05-07T20:23:13.8148182Z Upgrade  2 Packages
2025-05-07T20:23:13.8148396Z 
2025-05-07T20:23:13.8148548Z Total download size: 6.9 M
2025-05-07T20:23:13.8148920Z Downloading Packages:
2025-05-07T20:23:13.8581454Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  29 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:13.8989398Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  68 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:13.8998357Z --------------------------------------------------------------------------------
2025-05-07T20:23:13.9001593Z Total                                            81 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:13.9004239Z Running transaction check
2025-05-07T20:23:13.9103600Z Transaction check succeeded.
2025-05-07T20:23:13.9104217Z Running transaction test
2025-05-07T20:23:13.9399753Z Transaction test succeeded.
2025-05-07T20:23:13.9403106Z Running transaction
2025-05-07T20:23:14.4963533Z   Preparing        :                                                        1/1 
2025-05-07T20:23:14.6019594Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:14.6037695Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.6231816Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:14.6232622Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.6332899Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:14.6353956Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.7820705Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:14.7821539Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:14.7822273Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:14.7822837Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:14.9218136Z ================================================================================
2025-05-07T20:23:14.9218654Z WARNING:
2025-05-07T20:23:14.9218998Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:14.9219340Z 
2025-05-07T20:23:14.9219461Z   Available Versions:
2025-05-07T20:23:14.9219624Z 
2025-05-07T20:23:14.9219722Z   Version 2023.7.20250331:
2025-05-07T20:23:14.9220051Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:14.9220312Z 
2025-05-07T20:23:14.9220441Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:14.9220670Z 
2025-05-07T20:23:14.9220763Z     Release notes:
2025-05-07T20:23:14.9221194Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:14.9221836Z 
2025-05-07T20:23:14.9221950Z   Version 2023.7.20250414:
2025-05-07T20:23:14.9222275Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:14.9222541Z 
2025-05-07T20:23:14.9222667Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:14.9222888Z 
2025-05-07T20:23:14.9222987Z     Release notes:
2025-05-07T20:23:14.9223395Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:14.9223951Z 
2025-05-07T20:23:14.9224048Z   Version 2023.7.20250428:
2025-05-07T20:23:14.9224376Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:14.9224633Z 
2025-05-07T20:23:14.9224761Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:14.9224978Z 
2025-05-07T20:23:14.9225070Z     Release notes:
2025-05-07T20:23:14.9225480Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:14.9225861Z 
2025-05-07T20:23:14.9226183Z ================================================================================
2025-05-07T20:23:14.9793616Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:14.9794100Z 
2025-05-07T20:23:14.9794224Z Upgraded:
2025-05-07T20:23:14.9794727Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:14.9795343Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:14.9795744Z 
2025-05-07T20:23:14.9795830Z Complete!
2025-05-07T20:23:15.0238515Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:15.0263353Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:15.5594146Z Last metadata expiration check: 0:00:11 ago on Wed May  7 20:23:04 2025.
2025-05-07T20:23:15.5829944Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:15.6233842Z Dependencies resolved.
2025-05-07T20:23:15.6411067Z ================================================================================
2025-05-07T20:23:15.6412057Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:15.6412921Z ================================================================================
2025-05-07T20:23:15.6413535Z Installing:
2025-05-07T20:23:15.6414132Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:15.6414673Z 
2025-05-07T20:23:15.6414770Z Transaction Summary
2025-05-07T20:23:15.6415079Z ================================================================================
2025-05-07T20:23:15.6415399Z Install  1 Package
2025-05-07T20:23:15.6415540Z 
2025-05-07T20:23:15.6415656Z Total download size: 319 k
2025-05-07T20:23:15.6415917Z Installed size: 837 k
2025-05-07T20:23:15.6416166Z Downloading Packages:
2025-05-07T20:23:15.7136741Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        7.6 MB/s | 319 kB     00:00    
2025-05-07T20:23:15.7142539Z --------------------------------------------------------------------------------
2025-05-07T20:23:15.7145005Z Total                                           4.3 MB/s | 319 kB     00:00     
2025-05-07T20:23:15.7299741Z Running transaction check
2025-05-07T20:23:15.7355745Z Transaction check succeeded.
2025-05-07T20:23:15.7356450Z Running transaction test
2025-05-07T20:23:15.7816597Z Transaction test succeeded.
2025-05-07T20:23:15.7819917Z Running transaction
2025-05-07T20:23:15.8856089Z   Preparing        :                                                        1/1 
2025-05-07T20:23:15.9361779Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.1405615Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.2744028Z ================================================================================
2025-05-07T20:23:16.2744493Z WARNING:
2025-05-07T20:23:16.2744843Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:16.2745465Z 
2025-05-07T20:23:16.2745602Z   Available Versions:
2025-05-07T20:23:16.2745839Z 
2025-05-07T20:23:16.2745988Z   Version 2023.7.20250331:
2025-05-07T20:23:16.2746449Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:16.2746777Z 
2025-05-07T20:23:16.2746916Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:16.2747147Z 
2025-05-07T20:23:16.2747240Z     Release notes:
2025-05-07T20:23:16.2747668Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:16.2748051Z 
2025-05-07T20:23:16.2748146Z   Version 2023.7.20250414:
2025-05-07T20:23:16.2748475Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:16.2748734Z 
2025-05-07T20:23:16.2748861Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:16.2749079Z 
2025-05-07T20:23:16.2749175Z     Release notes:
2025-05-07T20:23:16.2749580Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:16.2749971Z 
2025-05-07T20:23:16.2750318Z   Version 2023.7.20250428:
2025-05-07T20:23:16.2750655Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:16.2750913Z 
2025-05-07T20:23:16.2751031Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:16.2751255Z 
2025-05-07T20:23:16.2751344Z     Release notes:
2025-05-07T20:23:16.2751753Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:16.2752136Z 
2025-05-07T20:23:16.2752255Z ================================================================================
2025-05-07T20:23:16.3092999Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:16.3093349Z 
2025-05-07T20:23:16.3093440Z Installed:
2025-05-07T20:23:16.3093767Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:16.3094067Z 
2025-05-07T20:23:16.3094158Z Complete!
2025-05-07T20:23:16.3552244Z + hostname
2025-05-07T20:23:16.3552454Z 
2025-05-07T20:23:16.3566991Z ip-10-0-65-139.ec2.internal
2025-05-07T20:23:16.3568655Z 
2025-05-07T20:23:16.3569141Z + sudo lshw -C display
2025-05-07T20:23:16.3569374Z 
2025-05-07T20:23:16.7835033Z   *-display:0 UNCLAIMED
2025-05-07T20:23:16.7835529Z        description: VGA compatible controller
2025-05-07T20:23:16.7835984Z        product: Amazon.com, Inc.
2025-05-07T20:23:16.7836390Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:16.7836748Z        physical id: 3
2025-05-07T20:23:16.7837077Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:16.7837423Z        version: 00
2025-05-07T20:23:16.7837718Z        width: 32 bits
2025-05-07T20:23:16.7837978Z        clock: 33MHz
2025-05-07T20:23:16.7838241Z        capabilities: vga_controller bus_master
2025-05-07T20:23:16.7838574Z        configuration: latency=0
2025-05-07T20:23:16.7838913Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:16.7839253Z   *-display:1
2025-05-07T20:23:16.7839517Z        description: 3D controller
2025-05-07T20:23:16.7839815Z        product: GA102GL [A10G]
2025-05-07T20:23:16.7840086Z        vendor: NVIDIA Corporation
2025-05-07T20:23:16.7840367Z        physical id: 1e
2025-05-07T20:23:16.7840615Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:16.7840876Z        version: a1
2025-05-07T20:23:16.7841102Z        width: 64 bits
2025-05-07T20:23:16.7841335Z        clock: 33MHz
2025-05-07T20:23:16.7841639Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:16.7842022Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:16.7842665Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:16.7875286Z 
2025-05-07T20:23:16.7875588Z ################################################################################
2025-05-07T20:23:16.7876074Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:16.8004532Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:16.8168986Z Wed May  7 20:23:16 2025       
2025-05-07T20:23:16.8169575Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.8170118Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:16.8170625Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.8171134Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:16.8171676Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:16.8172115Z |                                         |                        |               MIG M. |
2025-05-07T20:23:16.8172465Z |=========================================+========================+======================|
2025-05-07T20:23:16.8246586Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:16.8247454Z |  0%   30C    P0             58W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:16.8247850Z |                                         |                        |                  N/A |
2025-05-07T20:23:16.8248257Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:16.8248662Z                                                                                          
2025-05-07T20:23:16.8249066Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.8249501Z | Processes:                                                                              |
2025-05-07T20:23:16.8249951Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:16.8250374Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:16.8250747Z |=========================================================================================|
2025-05-07T20:23:16.8251635Z |  No running processes found                                                             |
2025-05-07T20:23:16.8252111Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:16.9658731Z ################################################################################
2025-05-07T20:23:16.9659121Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:16.9805223Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.9806026Z [CHECK] rocminfo not found
2025-05-07T20:23:16.9814497Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:16.9815581Z [CHECK] rocm-smi not found
2025-05-07T20:23:16.9861643Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:16.9862111Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:16.9874285Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:16.9874655Z env:
2025-05-07T20:23:16.9874894Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:16.9875221Z   BUILD_ENV: build_binary
2025-05-07T20:23:16.9875535Z   BUILD_TARGET: genai
2025-05-07T20:23:16.9875775Z   BUILD_VARIANT: cuda
2025-05-07T20:23:16.9876027Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:16.9876303Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:16.9876619Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:16.9876970Z ##[endgroup]
2025-05-07T20:23:17.3278071Z ################################################################################
2025-05-07T20:23:17.3278443Z # Setup Miniconda
2025-05-07T20:23:17.3278671Z #
2025-05-07T20:23:17.3295569Z # [2025-05-07T20:23:17.329Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:17.3295989Z ################################################################################
2025-05-07T20:23:17.3296220Z 
2025-05-07T20:23:17.3311017Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:17.4207164Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:17.4207687Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:17.4207983Z 
2025-05-07T20:23:17.4225595Z 
2025-05-07T20:23:17.4225805Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:17.4245936Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:18.1628469Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:18.1629031Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:18.1629383Z 
2025-05-07T20:23:18.1775309Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:18.6266407Z Unpacking payload ...
2025-05-07T20:23:19.1480922Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:19.9478512Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:22.0666952Z 
2025-05-07T20:23:22.0667366Z Installing base environment...
2025-05-07T20:23:22.0667601Z 
2025-05-07T20:23:23.1517892Z Preparing transaction: ...working... done
2025-05-07T20:23:26.1580804Z Executing transaction: ...working... done
2025-05-07T20:23:26.8208134Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:26.9107538Z installation finished.
2025-05-07T20:23:26.9114232Z 
2025-05-07T20:23:26.9114441Z + rm -f miniconda.sh
2025-05-07T20:23:26.9114653Z 
2025-05-07T20:23:26.9425933Z 
2025-05-07T20:23:26.9426478Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:23:26.9427346Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:23:27.3138464Z 
2025-05-07T20:23:27.3138656Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:23:27.3139154Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:23:27.3139670Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:23:27.3140193Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:23:27.3140701Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:23:27.3141243Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:23:27.3141707Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:23:27.3142170Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:23:27.3142647Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:23:27.3143439Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:23:27.3143991Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:23:27.3144381Z modified      /home/ec2-user/.bashrc
2025-05-07T20:23:27.3144582Z 
2025-05-07T20:23:27.3144799Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:23:27.3145113Z 
2025-05-07T20:23:27.3811749Z 
2025-05-07T20:23:27.3812077Z + . /home/ec2-user/.bashrc
2025-05-07T20:23:27.3812362Z 
2025-05-07T20:23:28.2223576Z 
2025-05-07T20:23:28.2224371Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:23:28.2247810Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:23:41.5595836Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:23:43.1369681Z Solving environment: \ | / - \ | / - \ | / - done
2025-05-07T20:23:43.2342050Z 
2025-05-07T20:23:43.2342441Z ## Package Plan ##
2025-05-07T20:23:43.2342686Z 
2025-05-07T20:23:43.2342889Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:43.2343217Z 
2025-05-07T20:23:43.2343357Z   added / updated specs:
2025-05-07T20:23:43.2343740Z     - conda-libmamba-solver
2025-05-07T20:23:43.2344134Z     - libarchive
2025-05-07T20:23:43.2344379Z     - libmamba
2025-05-07T20:23:43.2344601Z     - libmambapy
2025-05-07T20:23:43.2344736Z 
2025-05-07T20:23:43.2344741Z 
2025-05-07T20:23:43.2344887Z The following packages will be downloaded:
2025-05-07T20:23:43.2345117Z 
2025-05-07T20:23:43.2345237Z     package                    |            build
2025-05-07T20:23:43.2345571Z     ---------------------------|-----------------
2025-05-07T20:23:43.2346000Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:23:43.2346497Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:23:43.2346951Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:23:43.2347448Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:23:43.2347911Z     ------------------------------------------------------------
2025-05-07T20:23:43.2348284Z                                            Total:         1.4 MB
2025-05-07T20:23:43.2348501Z 
2025-05-07T20:23:43.2348631Z The following packages will be UPDATED:
2025-05-07T20:23:43.2348851Z 
2025-05-07T20:23:43.2354383Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:23:43.2355200Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:23:43.2355605Z 
2025-05-07T20:23:43.2355834Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:23:43.2356165Z 
2025-05-07T20:23:43.2356493Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:23:43.2357444Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:23:43.2357979Z 
2025-05-07T20:23:43.2357983Z 
2025-05-07T20:23:43.2357987Z 
2025-05-07T20:23:43.2358145Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:43.2358528Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:23:43.2358770Z 
2025-05-07T20:23:43.2360050Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:23:43.2360311Z 
2025-05-07T20:23:43.2364324Z 
2025-05-07T20:23:43.2376567Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:23:43.2376894Z 
2025-05-07T20:23:43.2376900Z 
2025-05-07T20:23:43.2376915Z 
2025-05-07T20:23:43.2904497Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:23:43.2904926Z 
2025-05-07T20:23:43.2904940Z 
2025-05-07T20:23:43.2904944Z 
2025-05-07T20:23:43.3012442Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.3012789Z 
2025-05-07T20:23:43.3012793Z 
2025-05-07T20:23:43.3012797Z 
2025-05-07T20:23:43.3187426Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:23:43.3187841Z 
2025-05-07T20:23:43.3187847Z 
2025-05-07T20:23:43.3207506Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.3313693Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.3315810Z 
2025-05-07T20:23:43.3437745Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.3438113Z 
2025-05-07T20:23:43.3438120Z 
2025-05-07T20:23:43.3442755Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.3443387Z 
2025-05-07T20:23:43.3443392Z 
2025-05-07T20:23:43.3553094Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:23:43.3553624Z 
2025-05-07T20:23:43.3555444Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.3555801Z 
2025-05-07T20:23:43.4548246Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:23:43.4548665Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4553830Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:23:43.4554191Z                                                      
2025-05-07T20:23:43.4554400Z 
2025-05-07T20:23:43.4554597Z                                                      [A
2025-05-07T20:23:43.4554808Z 
2025-05-07T20:23:43.4554812Z 
2025-05-07T20:23:43.4555013Z                                                      [A[A
2025-05-07T20:23:43.4555240Z 
2025-05-07T20:23:43.4555243Z 
2025-05-07T20:23:43.4555247Z 
2025-05-07T20:23:43.4555883Z                                                      [A[A[A done
2025-05-07T20:23:43.5558022Z Preparing transaction: | done
2025-05-07T20:23:43.6563577Z Verifying transaction: - done
2025-05-07T20:23:45.0583863Z Executing transaction: | / - \ | / - \ | / - \ | / done
2025-05-07T20:23:46.7856860Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:23:46.7882524Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:23:47.7087179Z Channels:
2025-05-07T20:23:47.7087524Z  - defaults
2025-05-07T20:23:47.7087845Z Platform: linux-64
2025-05-07T20:23:48.9354789Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:23:49.0554374Z Solving environment: - \ Channels:
2025-05-07T20:23:49.0554715Z  - defaults
2025-05-07T20:23:49.0554970Z Platform: linux-64
2025-05-07T20:23:49.3360419Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:23:49.5519118Z Solving environment: - \ | / done
2025-05-07T20:23:49.6336909Z done
2025-05-07T20:23:49.6999192Z 
2025-05-07T20:23:49.6999753Z ## Package Plan ##
2025-05-07T20:23:49.6999962Z 
2025-05-07T20:23:49.7000180Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:23:49.7000455Z 
2025-05-07T20:23:49.7000587Z   added / updated specs:
2025-05-07T20:23:49.7000967Z     - conda
2025-05-07T20:23:49.7001154Z 
2025-05-07T20:23:49.7001161Z 
2025-05-07T20:23:49.7001358Z The following packages will be downloaded:
2025-05-07T20:23:49.7001714Z 
2025-05-07T20:23:49.7001908Z     package                    |            build
2025-05-07T20:23:49.7002438Z     ---------------------------|-----------------
2025-05-07T20:23:49.7002897Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:23:49.7003957Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:23:49.7004454Z     ------------------------------------------------------------
2025-05-07T20:23:49.7004810Z                                            Total:         1.4 MB
2025-05-07T20:23:49.7005063Z 
2025-05-07T20:23:49.7005189Z The following packages will be UPDATED:
2025-05-07T20:23:49.7005410Z 
2025-05-07T20:23:49.7005732Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:23:49.7006269Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:23:49.7006523Z 
2025-05-07T20:23:49.7006528Z 
2025-05-07T20:23:49.7006532Z 
2025-05-07T20:23:49.7006689Z Downloading and Extracting Packages: ...working...
2025-05-07T20:23:49.7007076Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:23:49.7007302Z 
2025-05-07T20:23:49.7463774Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:23:49.7555498Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:49.7555906Z 
2025-05-07T20:23:50.0146694Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.0147305Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.0357243Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:23:50.0357489Z 
2025-05-07T20:23:50.0358034Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.0358321Z 
2025-05-07T20:23:50.0363257Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:23:50.0363795Z                                                      
2025-05-07T20:23:50.0364080Z 
2025-05-07T20:23:50.0364351Z                                                      [A done
2025-05-07T20:23:50.1370558Z Preparing transaction: \ done
2025-05-07T20:23:50.2378371Z Verifying transaction: / done
2025-05-07T20:23:52.7414781Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:23:53.3897432Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:23:53.3901116Z + conda clean --packages --tarball -y
2025-05-07T20:23:53.3901354Z 
2025-05-07T20:23:54.4375526Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:23:54.4375926Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:23:54.5026411Z 
2025-05-07T20:23:54.5035298Z + conda clean --all -y
2025-05-07T20:23:54.5035557Z 
2025-05-07T20:23:55.0749155Z There are no unused tarball(s) to remove.
2025-05-07T20:23:55.0749820Z Will remove 1 index cache(s).
2025-05-07T20:23:55.0750386Z There are no unused package(s) to remove.
2025-05-07T20:23:55.0751004Z There are no tempfile(s) to remove.
2025-05-07T20:23:55.0751585Z There are no logfile(s) to remove.
2025-05-07T20:23:55.1406088Z 
2025-05-07T20:23:55.1410676Z + conda info
2025-05-07T20:23:55.1410922Z 
2025-05-07T20:23:55.9309805Z 
2025-05-07T20:23:55.9310450Z      active environment : base
2025-05-07T20:23:55.9310966Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:23:55.9311464Z             shell level : 1
2025-05-07T20:23:55.9311767Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:23:55.9312181Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:23:55.9312594Z           conda version : 25.3.1
2025-05-07T20:23:55.9312899Z     conda-build version : not installed
2025-05-07T20:23:55.9313220Z          python version : 3.13.2.final.0
2025-05-07T20:23:55.9313683Z                  solver : libmamba (default)
2025-05-07T20:23:55.9314018Z        virtual packages : __archspec=1=zen2
2025-05-07T20:23:55.9314342Z                           __conda=25.3.1=0
2025-05-07T20:23:55.9314637Z                           __cuda=12.8=0
2025-05-07T20:23:55.9314939Z                           __glibc=2.34=0
2025-05-07T20:23:55.9315241Z                           __linux=6.1.130=0
2025-05-07T20:23:55.9315534Z                           __unix=0=0
2025-05-07T20:23:55.9315897Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:23:55.9316701Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:23:55.9317077Z   conda av metadata url : None
2025-05-07T20:23:55.9317478Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:23:55.9317945Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:23:55.9318360Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:23:55.9318762Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:23:55.9319160Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:23:55.9319527Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:23:55.9319891Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:23:55.9320259Z                           /home/ec2-user/.conda/envs
2025-05-07T20:23:55.9320592Z                platform : linux-64
2025-05-07T20:23:55.9321491Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:23:55.9322371Z                 UID:GID : 1000:1000
2025-05-07T20:23:55.9322821Z              netrc file : None
2025-05-07T20:23:55.9323104Z            offline mode : False
2025-05-07T20:23:55.9323287Z 
2025-05-07T20:23:56.0008612Z 
2025-05-07T20:23:56.0009061Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:23:56.0009883Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_34024b15-9fe3-4d03-9b77-057f24e11cb3 ...
2025-05-07T20:23:56.0010739Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:23:56.0084754Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10
2025-05-07T20:23:56.0085303Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.10[0m
2025-05-07T20:23:56.0104175Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:56.0104562Z env:
2025-05-07T20:23:56.0104807Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:56.0105142Z   BUILD_ENV: build_binary
2025-05-07T20:23:56.0105417Z   BUILD_TARGET: genai
2025-05-07T20:23:56.0105687Z   BUILD_VARIANT: cuda
2025-05-07T20:23:56.0105945Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:56.0106230Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:56.0106565Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:56.0106973Z ##[endgroup]
2025-05-07T20:23:56.3664563Z ################################################################################
2025-05-07T20:23:56.3664948Z # Create Conda Environment
2025-05-07T20:23:56.3665217Z #
2025-05-07T20:23:56.3680173Z # [2025-05-07T20:23:56.367Z] + create_conda_environment build_binary 3.10
2025-05-07T20:23:56.3680612Z ################################################################################
2025-05-07T20:23:56.3680839Z 
2025-05-07T20:23:56.3695433Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.4605881Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.4606436Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:23:56.4606898Z + conda info --envs
2025-05-07T20:23:56.4607089Z 
2025-05-07T20:23:57.2115882Z 
2025-05-07T20:23:57.2116632Z # conda environments:
2025-05-07T20:23:57.2116938Z #
2025-05-07T20:23:57.2117188Z base                   /home/ec2-user/miniconda
2025-05-07T20:23:57.2117419Z 
2025-05-07T20:23:57.2784303Z 
2025-05-07T20:23:57.2785134Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:23:58.9228528Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:23:58.9228920Z 
2025-05-07T20:23:58.9241989Z 
2025-05-07T20:23:58.9249711Z [SETUP] Creating new Conda environment (Python 3.10) ...
2025-05-07T20:23:58.9271806Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.10
2025-05-07T20:23:59.6948279Z Channels:
2025-05-07T20:23:59.6948626Z  - defaults
2025-05-07T20:23:59.6948959Z Platform: linux-64
2025-05-07T20:24:01.3521725Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:01.4527289Z Solving environment: / done
2025-05-07T20:24:01.4835972Z 
2025-05-07T20:24:01.4836298Z ## Package Plan ##
2025-05-07T20:24:01.4836532Z 
2025-05-07T20:24:01.4836853Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:01.4837333Z 
2025-05-07T20:24:01.4837473Z   added / updated specs:
2025-05-07T20:24:01.4837845Z     - python=3.10
2025-05-07T20:24:01.4838042Z 
2025-05-07T20:24:01.4838048Z 
2025-05-07T20:24:01.4838232Z The following packages will be downloaded:
2025-05-07T20:24:01.4838476Z 
2025-05-07T20:24:01.4838637Z     package                    |            build
2025-05-07T20:24:01.4839007Z     ---------------------------|-----------------
2025-05-07T20:24:01.4839459Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:01.4839936Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:01.4840402Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:01.4840868Z     python-3.10.16             |       he870216_1        26.9 MB
2025-05-07T20:24:01.4841663Z     setuptools-78.1.1          |  py310h06a4308_0         1.7 MB
2025-05-07T20:24:01.4842113Z     wheel-0.45.1               |  py310h06a4308_0         115 KB
2025-05-07T20:24:01.4842520Z     ------------------------------------------------------------
2025-05-07T20:24:01.4842903Z                                            Total:        28.8 MB
2025-05-07T20:24:01.4843140Z 
2025-05-07T20:24:01.4843293Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:01.4843541Z 
2025-05-07T20:24:01.4843989Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:01.4844505Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:01.4845162Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:01.4846309Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:01.4847068Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:01.4847592Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:01.4848230Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.4848733Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:01.4849275Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:01.4849792Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:01.4850275Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:01.4850752Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:01.4851204Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:01.4851658Z   python             pkgs/main/linux-64::python-3.10.16-he870216_1 
2025-05-07T20:24:01.4852144Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:01.4852666Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 
2025-05-07T20:24:01.4853195Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:01.4853632Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:01.4854067Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:01.4854531Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 
2025-05-07T20:24:01.4854977Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:01.4855403Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:01.4855671Z 
2025-05-07T20:24:01.4855676Z 
2025-05-07T20:24:01.4855680Z 
2025-05-07T20:24:01.4855850Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:01.4856280Z python-3.10.16       | 26.9 MB   |            |   0% 
2025-05-07T20:24:01.4856542Z 
2025-05-07T20:24:01.4856944Z setuptools-78.1.1    | 1.7 MB    |            |   0% [A
2025-05-07T20:24:01.4857223Z 
2025-05-07T20:24:01.4857228Z 
2025-05-07T20:24:01.4857484Z ca-certificates-2025 | 129 KB    |            |   0% [A[A
2025-05-07T20:24:01.4857768Z 
2025-05-07T20:24:01.4857772Z 
2025-05-07T20:24:01.4868869Z 
2025-05-07T20:24:01.4873694Z wheel-0.45.1         | 115 KB    |            |   0% [A[A[A
2025-05-07T20:24:01.4873966Z 
2025-05-07T20:24:01.4874007Z 
2025-05-07T20:24:01.4874011Z 
2025-05-07T20:24:01.4874059Z 
2025-05-07T20:24:01.4893039Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:01.4893553Z 
2025-05-07T20:24:01.4893560Z 
2025-05-07T20:24:01.4893583Z 
2025-05-07T20:24:01.4893590Z 
2025-05-07T20:24:01.4893597Z 
2025-05-07T20:24:01.5251411Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:01.5251871Z 
2025-05-07T20:24:01.5251878Z 
2025-05-07T20:24:01.5256392Z 
2025-05-07T20:24:01.5328634Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.5329024Z 
2025-05-07T20:24:01.5329029Z 
2025-05-07T20:24:01.5329034Z 
2025-05-07T20:24:01.5330633Z 
2025-05-07T20:24:01.5515321Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.5515762Z 
2025-05-07T20:24:01.5515769Z 
2025-05-07T20:24:01.5515775Z 
2025-05-07T20:24:01.5515781Z 
2025-05-07T20:24:01.5519505Z 
2025-05-07T20:24:01.5743994Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.5744402Z 
2025-05-07T20:24:01.5749060Z 
2025-05-07T20:24:01.5843602Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.5848198Z python-3.10.16       | 26.9 MB   | 3          |   3% 
2025-05-07T20:24:01.5848546Z 
2025-05-07T20:24:01.5848550Z 
2025-05-07T20:24:01.5848555Z 
2025-05-07T20:24:01.5848559Z 
2025-05-07T20:24:01.5848563Z 
2025-05-07T20:24:01.6040957Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:01.6041385Z 
2025-05-07T20:24:01.6041654Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.6041993Z 
2025-05-07T20:24:01.6232616Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:01.6233006Z 
2025-05-07T20:24:01.6233011Z 
2025-05-07T20:24:01.6236962Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.6237257Z 
2025-05-07T20:24:01.6239087Z 
2025-05-07T20:24:01.6845744Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A
2025-05-07T20:24:01.7012382Z python-3.10.16       | 26.9 MB   | #7         |  18% 
2025-05-07T20:24:01.7012679Z 
2025-05-07T20:24:01.7012685Z 
2025-05-07T20:24:01.7012691Z 
2025-05-07T20:24:01.7013759Z 
2025-05-07T20:24:01.7024251Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.7024658Z 
2025-05-07T20:24:01.7024663Z 
2025-05-07T20:24:01.7024668Z 
2025-05-07T20:24:01.7026562Z 
2025-05-07T20:24:01.7147485Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:01.7147881Z 
2025-05-07T20:24:01.7147886Z 
2025-05-07T20:24:01.7149236Z 
2025-05-07T20:24:01.7153225Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.7153758Z 
2025-05-07T20:24:01.7153765Z 
2025-05-07T20:24:01.7154338Z 
2025-05-07T20:24:01.7846514Z wheel-0.45.1         | 115 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:01.8853590Z python-3.10.16       | 26.9 MB   | ######1    |  61% 
2025-05-07T20:24:01.9881622Z python-3.10.16       | 26.9 MB   | #########4 |  94% 
2025-05-07T20:24:02.0620902Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:02.0621540Z 
2025-05-07T20:24:02.6605727Z setuptools-78.1.1    | 1.7 MB    | ########## | 100% [A
2025-05-07T20:24:02.6613010Z python-3.10.16       | 26.9 MB   | ########## | 100% 
2025-05-07T20:24:02.6613550Z                                                      
2025-05-07T20:24:02.6613875Z 
2025-05-07T20:24:02.6614131Z                                                      [A
2025-05-07T20:24:02.6614359Z 
2025-05-07T20:24:02.6614363Z 
2025-05-07T20:24:02.6614555Z                                                      [A[A
2025-05-07T20:24:02.6614791Z 
2025-05-07T20:24:02.6614811Z 
2025-05-07T20:24:02.6614815Z 
2025-05-07T20:24:02.6615002Z                                                      [A[A[A
2025-05-07T20:24:02.6615236Z 
2025-05-07T20:24:02.6615240Z 
2025-05-07T20:24:02.6615245Z 
2025-05-07T20:24:02.6615249Z 
2025-05-07T20:24:02.6615436Z                                                      [A[A[A[A
2025-05-07T20:24:02.6615765Z 
2025-05-07T20:24:02.6615780Z 
2025-05-07T20:24:02.6615786Z 
2025-05-07T20:24:02.6615792Z 
2025-05-07T20:24:02.6615798Z 
2025-05-07T20:24:02.6616030Z                                                      [A[A[A[A[A done
2025-05-07T20:24:02.8620412Z Preparing transaction: \ | done
2025-05-07T20:24:04.0284379Z Verifying transaction: - \ | / - \ | / - \ | done
2025-05-07T20:24:06.3463290Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:06.3973787Z #
2025-05-07T20:24:06.3974493Z # To activate this environment, use
2025-05-07T20:24:06.3975831Z #
2025-05-07T20:24:06.3976376Z #     $ conda activate build_binary
2025-05-07T20:24:06.3976939Z #
2025-05-07T20:24:06.3977408Z # To deactivate an active environment, use
2025-05-07T20:24:06.3978033Z #
2025-05-07T20:24:06.3978439Z #     $ conda deactivate
2025-05-07T20:24:06.3978792Z 
2025-05-07T20:24:06.5055843Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:06.5077188Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:09.6131802Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1)
2025-05-07T20:24:09.6132517Z Collecting pip
2025-05-07T20:24:09.6132881Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:09.6133361Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:09.6134307Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 92.8 MB/s eta 0:00:00
2025-05-07T20:24:09.6134714Z Installing collected packages: pip
2025-05-07T20:24:09.6135080Z   Attempting uninstall: pip
2025-05-07T20:24:09.6135408Z     Found existing installation: pip 25.1
2025-05-07T20:24:09.6135758Z     Uninstalling pip-25.1:
2025-05-07T20:24:09.6136080Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:09.6136438Z Successfully installed pip-25.1.1
2025-05-07T20:24:09.6136656Z 
2025-05-07T20:24:09.6800523Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:09.6826737Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:10.5948059Z Channels:
2025-05-07T20:24:10.5948515Z  - conda-forge
2025-05-07T20:24:10.5948972Z Platform: linux-64
2025-05-07T20:24:21.4389871Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:23.0607194Z Solving environment: - \ | / - done
2025-05-07T20:24:23.1239501Z 
2025-05-07T20:24:23.1239804Z ## Package Plan ##
2025-05-07T20:24:23.1240067Z 
2025-05-07T20:24:23.1240391Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:23.1240858Z 
2025-05-07T20:24:23.1240984Z   added / updated specs:
2025-05-07T20:24:23.1241279Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:23.1241491Z 
2025-05-07T20:24:23.1241495Z 
2025-05-07T20:24:23.1241628Z The following packages will be downloaded:
2025-05-07T20:24:23.1241869Z 
2025-05-07T20:24:23.1242001Z     package                    |            build
2025-05-07T20:24:23.1242369Z     ---------------------------|-----------------
2025-05-07T20:24:23.1242782Z     cffi-1.17.1                |  py310h8deb56e_0         238 KB  conda-forge
2025-05-07T20:24:23.1243267Z     cryptography-44.0.3        |  py310h6c63255_0         1.5 MB  conda-forge
2025-05-07T20:24:23.1243803Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:23.1244315Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:23.1244766Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:23.1245227Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:23.1245697Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:23.1246182Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:23.1246645Z     python_abi-3.10            |          2_cp310           4 KB  conda-forge
2025-05-07T20:24:23.1247151Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:23.1247682Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:23.1248139Z     ------------------------------------------------------------
2025-05-07T20:24:23.1248515Z                                            Total:         6.3 MB
2025-05-07T20:24:23.1248748Z 
2025-05-07T20:24:23.1248930Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:23.1249258Z 
2025-05-07T20:24:23.1249812Z   cffi               conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 
2025-05-07T20:24:23.1250358Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 
2025-05-07T20:24:23.1250897Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:23.1251383Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:23.1251891Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:23.1252548Z   python_abi         conda-forge/linux-64::python_abi-3.10-2_cp310 
2025-05-07T20:24:23.1253409Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:23.1254272Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:23.1254815Z 
2025-05-07T20:24:23.1254998Z The following packages will be UPDATED:
2025-05-07T20:24:23.1255329Z 
2025-05-07T20:24:23.1255971Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:23.1257257Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:23.1258343Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:23.1259257Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:23.1259783Z 
2025-05-07T20:24:23.1259787Z 
2025-05-07T20:24:23.1259801Z 
2025-05-07T20:24:23.1259965Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:23.1260380Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:23.1260623Z 
2025-05-07T20:24:23.1261043Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:23.1261308Z 
2025-05-07T20:24:23.1271324Z 
2025-05-07T20:24:23.1278270Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:23.1278643Z 
2025-05-07T20:24:23.1278665Z 
2025-05-07T20:24:23.1278671Z 
2025-05-07T20:24:23.1284697Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:23.1285110Z 
2025-05-07T20:24:23.1285115Z 
2025-05-07T20:24:23.1285121Z 
2025-05-07T20:24:23.1285127Z 
2025-05-07T20:24:23.1307806Z cffi-1.17.1          | 238 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:23.1308164Z 
2025-05-07T20:24:23.1308170Z 
2025-05-07T20:24:23.1308175Z 
2025-05-07T20:24:23.1308181Z 
2025-05-07T20:24:23.1313224Z 
2025-05-07T20:24:23.1318912Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:23.1319331Z 
2025-05-07T20:24:23.1319747Z 
2025-05-07T20:24:23.1319753Z 
2025-05-07T20:24:23.1319758Z 
2025-05-07T20:24:23.1319764Z 
2025-05-07T20:24:23.1322375Z 
2025-05-07T20:24:23.1327199Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:23.1327608Z 
2025-05-07T20:24:23.1327613Z 
2025-05-07T20:24:23.1327618Z 
2025-05-07T20:24:23.1327628Z 
2025-05-07T20:24:23.1327645Z 
2025-05-07T20:24:23.1327650Z 
2025-05-07T20:24:23.1328487Z 
2025-05-07T20:24:23.1332960Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:23.1333423Z 
2025-05-07T20:24:23.1333429Z 
2025-05-07T20:24:23.1333435Z 
2025-05-07T20:24:23.1333440Z 
2025-05-07T20:24:23.1333445Z 
2025-05-07T20:24:23.1333451Z 
2025-05-07T20:24:23.1333467Z 
2025-05-07T20:24:23.1333473Z 
2025-05-07T20:24:23.1333861Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.1334293Z 
2025-05-07T20:24:23.1334299Z 
2025-05-07T20:24:23.1334304Z 
2025-05-07T20:24:23.1334310Z 
2025-05-07T20:24:23.1334324Z 
2025-05-07T20:24:23.1334330Z 
2025-05-07T20:24:23.1334335Z 
2025-05-07T20:24:23.1334341Z 
2025-05-07T20:24:23.1334349Z 
2025-05-07T20:24:23.1335044Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.1335448Z 
2025-05-07T20:24:23.1335454Z 
2025-05-07T20:24:23.1335459Z 
2025-05-07T20:24:23.1335465Z 
2025-05-07T20:24:23.1335731Z 
2025-05-07T20:24:23.1335737Z 
2025-05-07T20:24:23.1335743Z 
2025-05-07T20:24:23.1335748Z 
2025-05-07T20:24:23.1335754Z 
2025-05-07T20:24:23.1335763Z 
2025-05-07T20:24:23.1928816Z python_abi-3.10      | 4 KB      |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.2062526Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:23.2062931Z 
2025-05-07T20:24:23.2062937Z 
2025-05-07T20:24:23.2062942Z 
2025-05-07T20:24:23.2064850Z 
2025-05-07T20:24:23.2242743Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:23.2249263Z 
2025-05-07T20:24:23.2284146Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:23.2284532Z 
2025-05-07T20:24:23.2284538Z 
2025-05-07T20:24:23.2285284Z 
2025-05-07T20:24:23.2367869Z libgomp-15.1.0       | 442 KB    | 7          |   7% [A[A[A
2025-05-07T20:24:23.2368254Z 
2025-05-07T20:24:23.2368261Z 
2025-05-07T20:24:23.2368266Z 
2025-05-07T20:24:23.2368272Z 
2025-05-07T20:24:23.2368299Z 
2025-05-07T20:24:23.2371926Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:23.2372349Z 
2025-05-07T20:24:23.2372355Z 
2025-05-07T20:24:23.2372361Z 
2025-05-07T20:24:23.2372366Z 
2025-05-07T20:24:23.2372372Z 
2025-05-07T20:24:23.2392493Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:23.2392904Z 
2025-05-07T20:24:23.2395604Z 
2025-05-07T20:24:23.2481759Z libgcc-15.1.0        | 810 KB    | 1          |   2% [A[A
2025-05-07T20:24:23.2482143Z 
2025-05-07T20:24:23.2482163Z 
2025-05-07T20:24:23.2482169Z 
2025-05-07T20:24:23.2482174Z 
2025-05-07T20:24:23.2482179Z 
2025-05-07T20:24:23.2482185Z 
2025-05-07T20:24:23.2546579Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:23.2547001Z 
2025-05-07T20:24:23.2547008Z 
2025-05-07T20:24:23.2547013Z 
2025-05-07T20:24:23.2547019Z 
2025-05-07T20:24:23.2547024Z 
2025-05-07T20:24:23.2547029Z 
2025-05-07T20:24:23.2675702Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:23.2676146Z 
2025-05-07T20:24:23.2676152Z 
2025-05-07T20:24:23.2676158Z 
2025-05-07T20:24:23.2676163Z 
2025-05-07T20:24:23.2676169Z 
2025-05-07T20:24:23.2676183Z 
2025-05-07T20:24:23.2676189Z 
2025-05-07T20:24:23.2726972Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:23.2727439Z 
2025-05-07T20:24:23.2727446Z 
2025-05-07T20:24:23.2727451Z 
2025-05-07T20:24:23.2727466Z 
2025-05-07T20:24:23.2727472Z 
2025-05-07T20:24:23.2727478Z 
2025-05-07T20:24:23.2727848Z 
2025-05-07T20:24:23.2753134Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:23.2753687Z 
2025-05-07T20:24:23.2753692Z 
2025-05-07T20:24:23.2753698Z 
2025-05-07T20:24:23.2753704Z 
2025-05-07T20:24:23.2753718Z 
2025-05-07T20:24:23.2753724Z 
2025-05-07T20:24:23.2753729Z 
2025-05-07T20:24:23.2753734Z 
2025-05-07T20:24:23.2797029Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.2797362Z 
2025-05-07T20:24:23.2797373Z 
2025-05-07T20:24:23.2797377Z 
2025-05-07T20:24:23.2797381Z 
2025-05-07T20:24:23.2797385Z 
2025-05-07T20:24:23.2797389Z 
2025-05-07T20:24:23.2797393Z 
2025-05-07T20:24:23.2799076Z 
2025-05-07T20:24:23.3011978Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.3012306Z 
2025-05-07T20:24:23.3013751Z 
2025-05-07T20:24:23.3016015Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:23.3016294Z 
2025-05-07T20:24:23.3016312Z 
2025-05-07T20:24:23.3016316Z 
2025-05-07T20:24:23.3016320Z 
2025-05-07T20:24:23.3016324Z 
2025-05-07T20:24:23.3016328Z 
2025-05-07T20:24:23.3016332Z 
2025-05-07T20:24:23.3016336Z 
2025-05-07T20:24:23.3016674Z 
2025-05-07T20:24:23.3026851Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.3027175Z 
2025-05-07T20:24:23.3027180Z 
2025-05-07T20:24:23.3027191Z 
2025-05-07T20:24:23.3041382Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:23.3041884Z 
2025-05-07T20:24:23.3041888Z 
2025-05-07T20:24:23.3041892Z 
2025-05-07T20:24:23.3041897Z 
2025-05-07T20:24:23.3041907Z 
2025-05-07T20:24:23.3041911Z 
2025-05-07T20:24:23.3041915Z 
2025-05-07T20:24:23.3041918Z 
2025-05-07T20:24:23.3043674Z 
2025-05-07T20:24:23.3176200Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.3176728Z 
2025-05-07T20:24:23.3176736Z 
2025-05-07T20:24:23.3176743Z 
2025-05-07T20:24:23.3176750Z 
2025-05-07T20:24:23.3181519Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:23.3182045Z 
2025-05-07T20:24:23.3182053Z 
2025-05-07T20:24:23.3182060Z 
2025-05-07T20:24:23.3182068Z 
2025-05-07T20:24:23.3318395Z cffi-1.17.1          | 238 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:23.3318686Z 
2025-05-07T20:24:23.3318691Z 
2025-05-07T20:24:23.3318695Z 
2025-05-07T20:24:23.3318699Z 
2025-05-07T20:24:23.3318703Z 
2025-05-07T20:24:23.3318707Z 
2025-05-07T20:24:23.3318722Z 
2025-05-07T20:24:23.3318726Z 
2025-05-07T20:24:23.3318730Z 
2025-05-07T20:24:23.3320670Z 
2025-05-07T20:24:23.3333837Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.3334164Z 
2025-05-07T20:24:23.3334169Z 
2025-05-07T20:24:23.3334173Z 
2025-05-07T20:24:23.3334177Z 
2025-05-07T20:24:23.3334181Z 
2025-05-07T20:24:23.3334185Z 
2025-05-07T20:24:23.3334189Z 
2025-05-07T20:24:23.3334193Z 
2025-05-07T20:24:23.3334197Z 
2025-05-07T20:24:23.3334201Z 
2025-05-07T20:24:23.3499550Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.3500084Z 
2025-05-07T20:24:23.3500091Z 
2025-05-07T20:24:23.3500098Z 
2025-05-07T20:24:23.3500105Z 
2025-05-07T20:24:23.3500112Z 
2025-05-07T20:24:23.4944784Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:23.4945152Z 
2025-05-07T20:24:23.4945157Z 
2025-05-07T20:24:23.4945161Z 
2025-05-07T20:24:23.4945165Z 
2025-05-07T20:24:23.4945194Z 
2025-05-07T20:24:23.4945198Z 
2025-05-07T20:24:23.4949105Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:23.4949421Z 
2025-05-07T20:24:23.4949425Z 
2025-05-07T20:24:23.4949429Z 
2025-05-07T20:24:23.4949433Z 
2025-05-07T20:24:23.4949437Z 
2025-05-07T20:24:23.4949449Z 
2025-05-07T20:24:23.5174800Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:23.5175242Z 
2025-05-07T20:24:23.5175247Z 
2025-05-07T20:24:23.5175251Z 
2025-05-07T20:24:23.5175266Z 
2025-05-07T20:24:23.5175278Z 
2025-05-07T20:24:23.5175282Z 
2025-05-07T20:24:23.5175286Z 
2025-05-07T20:24:23.5182345Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:23.5182694Z 
2025-05-07T20:24:23.5182699Z 
2025-05-07T20:24:23.5182711Z 
2025-05-07T20:24:23.5182716Z 
2025-05-07T20:24:23.5182720Z 
2025-05-07T20:24:23.5182725Z 
2025-05-07T20:24:23.5182729Z 
2025-05-07T20:24:23.5270137Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:23.5276607Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:23.5434921Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:23.5435307Z 
2025-05-07T20:24:23.5435312Z 
2025-05-07T20:24:23.5435316Z 
2025-05-07T20:24:23.5435320Z 
2025-05-07T20:24:23.5435324Z 
2025-05-07T20:24:23.5435329Z 
2025-05-07T20:24:23.5435333Z 
2025-05-07T20:24:23.5435337Z 
2025-05-07T20:24:23.5439105Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.5439442Z 
2025-05-07T20:24:23.5439447Z 
2025-05-07T20:24:23.5439451Z 
2025-05-07T20:24:23.5439455Z 
2025-05-07T20:24:23.5439459Z 
2025-05-07T20:24:23.5439463Z 
2025-05-07T20:24:23.5439467Z 
2025-05-07T20:24:23.5439793Z 
2025-05-07T20:24:23.6209299Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6209664Z 
2025-05-07T20:24:23.6209669Z 
2025-05-07T20:24:23.6210394Z 
2025-05-07T20:24:23.6216911Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:23.6217447Z 
2025-05-07T20:24:23.6217451Z 
2025-05-07T20:24:23.6217606Z 
2025-05-07T20:24:23.6370938Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:23.6371402Z 
2025-05-07T20:24:23.6371747Z 
2025-05-07T20:24:23.6380285Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:23.6380735Z 
2025-05-07T20:24:23.6381256Z 
2025-05-07T20:24:23.6546572Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:23.6547148Z 
2025-05-07T20:24:23.6547154Z 
2025-05-07T20:24:23.6547158Z 
2025-05-07T20:24:23.6547162Z 
2025-05-07T20:24:23.6547166Z 
2025-05-07T20:24:23.6547181Z 
2025-05-07T20:24:23.6547185Z 
2025-05-07T20:24:23.6547189Z 
2025-05-07T20:24:23.6547193Z 
2025-05-07T20:24:23.6547197Z 
2025-05-07T20:24:23.6575532Z python_abi-3.10      | 4 KB      | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6575960Z 
2025-05-07T20:24:23.6578297Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:23.6578593Z 
2025-05-07T20:24:23.6627504Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:23.6627881Z 
2025-05-07T20:24:23.6627888Z 
2025-05-07T20:24:23.6627894Z 
2025-05-07T20:24:23.6627900Z 
2025-05-07T20:24:23.6627906Z 
2025-05-07T20:24:23.6627911Z 
2025-05-07T20:24:23.6627917Z 
2025-05-07T20:24:23.6627933Z 
2025-05-07T20:24:23.6627939Z 
2025-05-07T20:24:23.6628314Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6628628Z 
2025-05-07T20:24:23.6628633Z 
2025-05-07T20:24:23.6628641Z 
2025-05-07T20:24:23.6628645Z 
2025-05-07T20:24:23.6628656Z 
2025-05-07T20:24:23.6628660Z 
2025-05-07T20:24:23.6628664Z 
2025-05-07T20:24:23.6628668Z 
2025-05-07T20:24:23.6628672Z 
2025-05-07T20:24:23.6636333Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6636922Z                                                      
2025-05-07T20:24:23.6637231Z 
2025-05-07T20:24:23.6637483Z                                                      [A
2025-05-07T20:24:23.6637791Z 
2025-05-07T20:24:23.6637797Z 
2025-05-07T20:24:23.6638018Z                                                      [A[A
2025-05-07T20:24:23.6638258Z 
2025-05-07T20:24:23.6638262Z 
2025-05-07T20:24:23.6638266Z 
2025-05-07T20:24:23.6638448Z                                                      [A[A[A
2025-05-07T20:24:23.6638679Z 
2025-05-07T20:24:23.6638683Z 
2025-05-07T20:24:23.6638687Z 
2025-05-07T20:24:23.6638691Z 
2025-05-07T20:24:23.6638881Z                                                      [A[A[A[A
2025-05-07T20:24:23.6639122Z 
2025-05-07T20:24:23.6639128Z 
2025-05-07T20:24:23.6639133Z 
2025-05-07T20:24:23.6639139Z 
2025-05-07T20:24:23.6639145Z 
2025-05-07T20:24:23.6639420Z                                                      [A[A[A[A[A
2025-05-07T20:24:23.6639673Z 
2025-05-07T20:24:23.6639684Z 
2025-05-07T20:24:23.6639688Z 
2025-05-07T20:24:23.6639693Z 
2025-05-07T20:24:23.6639697Z 
2025-05-07T20:24:23.6639707Z 
2025-05-07T20:24:23.6639920Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:23.6640169Z 
2025-05-07T20:24:23.6640173Z 
2025-05-07T20:24:23.6640177Z 
2025-05-07T20:24:23.6640181Z 
2025-05-07T20:24:23.6640185Z 
2025-05-07T20:24:23.6640188Z 
2025-05-07T20:24:23.6640192Z 
2025-05-07T20:24:23.6640384Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:23.6640626Z 
2025-05-07T20:24:23.6640630Z 
2025-05-07T20:24:23.6640634Z 
2025-05-07T20:24:23.6640643Z 
2025-05-07T20:24:23.6640648Z 
2025-05-07T20:24:23.6640652Z 
2025-05-07T20:24:23.6640656Z 
2025-05-07T20:24:23.6640659Z 
2025-05-07T20:24:23.6640855Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6641098Z 
2025-05-07T20:24:23.6641102Z 
2025-05-07T20:24:23.6641106Z 
2025-05-07T20:24:23.6641110Z 
2025-05-07T20:24:23.6641114Z 
2025-05-07T20:24:23.6641118Z 
2025-05-07T20:24:23.6641122Z 
2025-05-07T20:24:23.6641321Z 
2025-05-07T20:24:23.6641325Z 
2025-05-07T20:24:23.6641531Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:23.6641773Z 
2025-05-07T20:24:23.6641777Z 
2025-05-07T20:24:23.6641781Z 
2025-05-07T20:24:23.6641785Z 
2025-05-07T20:24:23.6641789Z 
2025-05-07T20:24:23.6641793Z 
2025-05-07T20:24:23.6641797Z 
2025-05-07T20:24:23.6641801Z 
2025-05-07T20:24:23.6641805Z 
2025-05-07T20:24:23.6641809Z 
2025-05-07T20:24:23.6642158Z                                                      [A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:23.7644781Z Preparing transaction: | done
2025-05-07T20:24:23.8650197Z Verifying transaction: - done
2025-05-07T20:24:25.3676448Z Executing transaction: | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:25.5473833Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:27.3006181Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:27.3019037Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:27.3042614Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:28.1645949Z Channels:
2025-05-07T20:24:28.1646255Z  - conda-forge
2025-05-07T20:24:28.1646505Z Platform: linux-64
2025-05-07T20:24:31.4977236Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:31.8769293Z Solving environment: \ done
2025-05-07T20:24:31.9402108Z 
2025-05-07T20:24:31.9402405Z ## Package Plan ##
2025-05-07T20:24:31.9402591Z 
2025-05-07T20:24:31.9402849Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:31.9403192Z 
2025-05-07T20:24:31.9403307Z   added / updated specs:
2025-05-07T20:24:31.9403587Z     - libxcrypt
2025-05-07T20:24:31.9403732Z 
2025-05-07T20:24:31.9403737Z 
2025-05-07T20:24:31.9403871Z The following packages will be downloaded:
2025-05-07T20:24:31.9404113Z 
2025-05-07T20:24:31.9404242Z     package                    |            build
2025-05-07T20:24:31.9404599Z     ---------------------------|-----------------
2025-05-07T20:24:31.9405030Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:24:31.9405468Z     ------------------------------------------------------------
2025-05-07T20:24:31.9405845Z                                            Total:          98 KB
2025-05-07T20:24:31.9406073Z 
2025-05-07T20:24:31.9406220Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:31.9406459Z 
2025-05-07T20:24:31.9406713Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:24:31.9407023Z 
2025-05-07T20:24:31.9407028Z 
2025-05-07T20:24:31.9407032Z 
2025-05-07T20:24:31.9407189Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:32.1123601Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:24:32.1152295Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:24:32.1261231Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:32.1264977Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:24:32.1265843Z                                                      
2025-05-07T20:24:32.1266180Z  done
2025-05-07T20:24:32.2269890Z Preparing transaction: / done
2025-05-07T20:24:32.3273342Z Verifying transaction: \ done
2025-05-07T20:24:32.4280026Z Executing transaction: / done
2025-05-07T20:24:35.9563134Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:24:35.9563893Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h
2025-05-07T20:24:35.9564459Z 
2025-05-07T20:24:35.9603538Z 
2025-05-07T20:24:37.5974944Z [SETUP] Installed Python version: Python 3.10.16
2025-05-07T20:24:37.5975426Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:24:37.6010446Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:24:37.6010951Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:24:37.6026195Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:37.6026751Z env:
2025-05-07T20:24:37.6027006Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:37.6027338Z   BUILD_ENV: build_binary
2025-05-07T20:24:37.6027614Z   BUILD_TARGET: genai
2025-05-07T20:24:37.6027873Z   BUILD_VARIANT: cuda
2025-05-07T20:24:37.6028131Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:37.6028419Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:37.6028754Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:37.6029121Z ##[endgroup]
2025-05-07T20:24:37.9509511Z ################################################################################
2025-05-07T20:24:37.9510086Z # Install C/C++ Compilers
2025-05-07T20:24:37.9510495Z #
2025-05-07T20:24:37.9525737Z # [2025-05-07T20:24:37.952Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:24:37.9526406Z ################################################################################
2025-05-07T20:24:37.9526774Z 
2025-05-07T20:24:37.9541350Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:38.0461497Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:38.0470617Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:24:38.0494156Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:24:38.9642019Z Channels:
2025-05-07T20:24:38.9642285Z  - conda-forge
2025-05-07T20:24:38.9642542Z Platform: linux-64
2025-05-07T20:24:42.3619254Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:42.7289500Z Solving environment: \ done
2025-05-07T20:24:42.7910481Z 
2025-05-07T20:24:42.7910852Z ## Package Plan ##
2025-05-07T20:24:42.7911100Z 
2025-05-07T20:24:42.7911382Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:42.7911703Z 
2025-05-07T20:24:42.7911806Z   added / updated specs:
2025-05-07T20:24:42.7912084Z     - sysroot_linux-64=2.17
2025-05-07T20:24:42.7912260Z 
2025-05-07T20:24:42.7912277Z 
2025-05-07T20:24:42.7912411Z The following packages will be downloaded:
2025-05-07T20:24:42.7912637Z 
2025-05-07T20:24:42.7912765Z     package                    |            build
2025-05-07T20:24:42.7913097Z     ---------------------------|-----------------
2025-05-07T20:24:42.7913624Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:24:42.7914204Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:24:42.7914781Z     ------------------------------------------------------------
2025-05-07T20:24:42.7915254Z                                            Total:        15.4 MB
2025-05-07T20:24:42.7915546Z 
2025-05-07T20:24:42.7915709Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:42.7915951Z 
2025-05-07T20:24:42.7916264Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:24:42.7917054Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:24:42.7917397Z 
2025-05-07T20:24:42.7917401Z 
2025-05-07T20:24:42.7917405Z 
2025-05-07T20:24:42.7917557Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:42.7918119Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:42.7918483Z 
2025-05-07T20:24:43.0045552Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:24:43.0087901Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:24:43.0088169Z 
2025-05-07T20:24:43.0205978Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:24:43.0206613Z 
2025-05-07T20:24:43.1046074Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:43.1513353Z sysroot_linux-64-2.1 | 14.5 MB   | #########  |  90% 
2025-05-07T20:24:43.3629121Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:43.3629404Z 
2025-05-07T20:24:43.3631226Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:43.3631489Z 
2025-05-07T20:24:43.8195040Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:24:43.8199758Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:24:43.8200461Z                                                      
2025-05-07T20:24:43.8200868Z 
2025-05-07T20:24:43.8201137Z                                                      [A done
2025-05-07T20:24:43.9204885Z Preparing transaction: / done
2025-05-07T20:24:44.1210263Z Verifying transaction: \ | done
2025-05-07T20:24:44.3287928Z Executing transaction: - \ done
2025-05-07T20:24:44.4809271Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:24:44.4809758Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:24:46.2391983Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:24:46.2405600Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:24:46.2429969Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:24:47.1980447Z Channels:
2025-05-07T20:24:47.1980808Z  - conda-forge
2025-05-07T20:24:47.1981152Z Platform: linux-64
2025-05-07T20:24:50.6202635Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:51.5819545Z Solving environment: \ | / done
2025-05-07T20:24:51.6463054Z 
2025-05-07T20:24:51.6463378Z ## Package Plan ##
2025-05-07T20:24:51.6463544Z 
2025-05-07T20:24:51.6463766Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:51.6464198Z 
2025-05-07T20:24:51.6464345Z   added / updated specs:
2025-05-07T20:24:51.6464656Z     - gxx_linux-64=11.4.0
2025-05-07T20:24:51.6464829Z 
2025-05-07T20:24:51.6464834Z 
2025-05-07T20:24:51.6464971Z The following packages will be downloaded:
2025-05-07T20:24:51.6465224Z 
2025-05-07T20:24:51.6465348Z     package                    |            build
2025-05-07T20:24:51.6465694Z     ---------------------------|-----------------
2025-05-07T20:24:51.6466119Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:24:51.6466657Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:24:51.6467141Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:24:51.6467609Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:24:51.6468076Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:24:51.6468538Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:24:51.6468986Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:24:51.6469484Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:24:51.6469983Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:24:51.6470439Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:24:51.6470937Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:24:51.6471440Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:24:51.6471863Z     ------------------------------------------------------------
2025-05-07T20:24:51.6472217Z                                            Total:        91.6 MB
2025-05-07T20:24:51.6472446Z 
2025-05-07T20:24:51.6472581Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:51.6472811Z 
2025-05-07T20:24:51.6473092Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:24:51.6473796Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:24:51.6474719Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:24:51.6475264Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:24:51.6475795Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:24:51.6476475Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:24:51.6477036Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:51.6477623Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:24:51.6478144Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:24:51.6478708Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:24:51.6479091Z 
2025-05-07T20:24:51.6479210Z The following packages will be UPDATED:
2025-05-07T20:24:51.6479430Z 
2025-05-07T20:24:51.6479762Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:24:51.6480511Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:24:51.6480940Z 
2025-05-07T20:24:51.6480944Z 
2025-05-07T20:24:51.6480948Z 
2025-05-07T20:24:51.6481103Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:51.6481496Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:51.6481739Z 
2025-05-07T20:24:51.6482079Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:51.6482326Z 
2025-05-07T20:24:51.6482336Z 
2025-05-07T20:24:51.6486029Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:51.6486299Z 
2025-05-07T20:24:51.6486557Z 
2025-05-07T20:24:51.6490205Z 
2025-05-07T20:24:51.6512662Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:51.6512999Z 
2025-05-07T20:24:51.6513003Z 
2025-05-07T20:24:51.6513015Z 
2025-05-07T20:24:51.6513032Z 
2025-05-07T20:24:51.6526248Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:51.6526661Z 
2025-05-07T20:24:51.6526667Z 
2025-05-07T20:24:51.6526672Z 
2025-05-07T20:24:51.6526686Z 
2025-05-07T20:24:51.6526709Z 
2025-05-07T20:24:51.6528229Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:51.6528617Z 
2025-05-07T20:24:51.6528622Z 
2025-05-07T20:24:51.6528633Z 
2025-05-07T20:24:51.6528637Z 
2025-05-07T20:24:51.6528640Z 
2025-05-07T20:24:51.6528644Z 
2025-05-07T20:24:51.6544337Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:51.6544758Z 
2025-05-07T20:24:51.6544773Z 
2025-05-07T20:24:51.6544779Z 
2025-05-07T20:24:51.6544784Z 
2025-05-07T20:24:51.6544790Z 
2025-05-07T20:24:51.6544795Z 
2025-05-07T20:24:51.6544800Z 
2025-05-07T20:24:51.6546808Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:51.6547238Z 
2025-05-07T20:24:51.6547245Z 
2025-05-07T20:24:51.6547263Z 
2025-05-07T20:24:51.6547269Z 
2025-05-07T20:24:51.6547275Z 
2025-05-07T20:24:51.6547280Z 
2025-05-07T20:24:51.6547286Z 
2025-05-07T20:24:51.6547291Z 
2025-05-07T20:24:51.6550503Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6551015Z 
2025-05-07T20:24:51.6551022Z 
2025-05-07T20:24:51.6551029Z 
2025-05-07T20:24:51.6551035Z 
2025-05-07T20:24:51.6551040Z 
2025-05-07T20:24:51.6551047Z 
2025-05-07T20:24:51.6551052Z 
2025-05-07T20:24:51.6551058Z 
2025-05-07T20:24:51.6551064Z 
2025-05-07T20:24:51.6552015Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6552405Z 
2025-05-07T20:24:51.6552411Z 
2025-05-07T20:24:51.6552416Z 
2025-05-07T20:24:51.6552421Z 
2025-05-07T20:24:51.6552427Z 
2025-05-07T20:24:51.6552442Z 
2025-05-07T20:24:51.6552447Z 
2025-05-07T20:24:51.6552453Z 
2025-05-07T20:24:51.6552459Z 
2025-05-07T20:24:51.6566795Z 
2025-05-07T20:24:51.6569153Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.6569638Z 
2025-05-07T20:24:51.6569644Z 
2025-05-07T20:24:51.6569649Z 
2025-05-07T20:24:51.6569655Z 
2025-05-07T20:24:51.6569660Z 
2025-05-07T20:24:51.6569665Z 
2025-05-07T20:24:51.6569825Z 
2025-05-07T20:24:51.6569829Z 
2025-05-07T20:24:51.6569833Z 
2025-05-07T20:24:51.6569837Z 
2025-05-07T20:24:51.6569841Z 
2025-05-07T20:24:51.7701871Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:51.7702256Z 
2025-05-07T20:24:51.7702260Z 
2025-05-07T20:24:51.7702264Z 
2025-05-07T20:24:51.7702268Z 
2025-05-07T20:24:51.7847091Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:24:51.7847514Z 
2025-05-07T20:24:51.7847521Z 
2025-05-07T20:24:51.7849211Z 
2025-05-07T20:24:51.8724435Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:24:51.8724816Z 
2025-05-07T20:24:51.8724820Z 
2025-05-07T20:24:51.8724825Z 
2025-05-07T20:24:51.8725158Z 
2025-05-07T20:24:51.9016821Z libstdcxx-15.1.0     | 3.7 MB    | #6         |  17% [A[A[A[A
2025-05-07T20:24:51.9017417Z 
2025-05-07T20:24:51.9017426Z 
2025-05-07T20:24:51.9067578Z 
2025-05-07T20:24:51.9173054Z binutils_impl_linux- | 6.0 MB    | 6          |   7% [A[A[A
2025-05-07T20:24:51.9173521Z 
2025-05-07T20:24:51.9173526Z 
2025-05-07T20:24:51.9221417Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:24:51.9222343Z 
2025-05-07T20:24:51.9730940Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:24:51.9731227Z 
2025-05-07T20:24:51.9731231Z 
2025-05-07T20:24:51.9731235Z 
2025-05-07T20:24:51.9731452Z 
2025-05-07T20:24:51.9745265Z libstdcxx-15.1.0     | 3.7 MB    | #######7   |  78% [A[A[A[A
2025-05-07T20:24:52.0017585Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:24:52.0017859Z 
2025-05-07T20:24:52.0017864Z 
2025-05-07T20:24:52.0018168Z 
2025-05-07T20:24:52.0176281Z binutils_impl_linux- | 6.0 MB    | #####9     |  59% [A[A[A
2025-05-07T20:24:52.0176609Z 
2025-05-07T20:24:52.0176615Z 
2025-05-07T20:24:52.0223615Z libstdcxx-devel_linu | 11.1 MB   | ###7       |  37% [A[A
2025-05-07T20:24:52.0224145Z 
2025-05-07T20:24:52.0702304Z gxx_impl_linux-64-11 | 11.2 MB   | ##9        |  29% [A
2025-05-07T20:24:52.0702691Z 
2025-05-07T20:24:52.0702696Z 
2025-05-07T20:24:52.0702700Z 
2025-05-07T20:24:52.0707306Z 
2025-05-07T20:24:52.0755180Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:52.1183094Z gcc_impl_linux-64-11 | 53.0 MB   | 5          |   5% 
2025-05-07T20:24:52.1183427Z 
2025-05-07T20:24:52.1183432Z 
2025-05-07T20:24:52.1183436Z 
2025-05-07T20:24:52.1183439Z 
2025-05-07T20:24:52.1191299Z 
2025-05-07T20:24:52.1228407Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:52.1230952Z 
2025-05-07T20:24:52.1558515Z gxx_impl_linux-64-11 | 11.2 MB   | ######     |  60% [A
2025-05-07T20:24:52.1558941Z 
2025-05-07T20:24:52.1560363Z 
2025-05-07T20:24:52.1758655Z libstdcxx-devel_linu | 11.1 MB   | #######2   |  72% [A[A
2025-05-07T20:24:52.2184552Z gcc_impl_linux-64-11 | 53.0 MB   | #1         |  12% 
2025-05-07T20:24:52.2185092Z 
2025-05-07T20:24:52.2185098Z 
2025-05-07T20:24:52.2185104Z 
2025-05-07T20:24:52.2185134Z 
2025-05-07T20:24:52.2188494Z 
2025-05-07T20:24:52.2228651Z libsanitizer-11.4.0  | 3.5 MB    | ########7  |  87% [A[A[A[A[A
2025-05-07T20:24:52.2228979Z 
2025-05-07T20:24:52.2380260Z gxx_impl_linux-64-11 | 11.2 MB   | ########7  |  88% [A
2025-05-07T20:24:52.2380643Z 
2025-05-07T20:24:52.2380657Z 
2025-05-07T20:24:52.2381453Z 
2025-05-07T20:24:52.2382086Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:52.2382380Z 
2025-05-07T20:24:52.2382385Z 
2025-05-07T20:24:52.2386495Z 
2025-05-07T20:24:52.2586038Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:52.2586411Z 
2025-05-07T20:24:52.2587907Z 
2025-05-07T20:24:52.2760235Z libstdcxx-devel_linu | 11.1 MB   | #########5 |  96% [A[A
2025-05-07T20:24:52.2785371Z gcc_impl_linux-64-11 | 53.0 MB   | #7         |  18% 
2025-05-07T20:24:52.2785755Z 
2025-05-07T20:24:52.2785760Z 
2025-05-07T20:24:52.2785764Z 
2025-05-07T20:24:52.2785768Z 
2025-05-07T20:24:52.2785772Z 
2025-05-07T20:24:52.2785989Z 
2025-05-07T20:24:52.3564878Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:24:52.3565262Z 
2025-05-07T20:24:52.3565266Z 
2025-05-07T20:24:52.3565270Z 
2025-05-07T20:24:52.3565274Z 
2025-05-07T20:24:52.3568826Z 
2025-05-07T20:24:52.3763174Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:52.3891736Z gcc_impl_linux-64-11 | 53.0 MB   | ##3        |  24% 
2025-05-07T20:24:52.3892014Z 
2025-05-07T20:24:52.3892019Z 
2025-05-07T20:24:52.3892024Z 
2025-05-07T20:24:52.3892030Z 
2025-05-07T20:24:52.3892037Z 
2025-05-07T20:24:52.3892042Z 
2025-05-07T20:24:52.3894503Z 
2025-05-07T20:24:52.4294209Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:24:52.4294530Z 
2025-05-07T20:24:52.4294556Z 
2025-05-07T20:24:52.4294560Z 
2025-05-07T20:24:52.4294565Z 
2025-05-07T20:24:52.4294568Z 
2025-05-07T20:24:52.4296894Z 
2025-05-07T20:24:52.4301526Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:52.4301953Z 
2025-05-07T20:24:52.4301959Z 
2025-05-07T20:24:52.4301964Z 
2025-05-07T20:24:52.4301969Z 
2025-05-07T20:24:52.4301975Z 
2025-05-07T20:24:52.4301980Z 
2025-05-07T20:24:52.4423367Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:52.4423694Z 
2025-05-07T20:24:52.4423698Z 
2025-05-07T20:24:52.4423702Z 
2025-05-07T20:24:52.4423706Z 
2025-05-07T20:24:52.4423710Z 
2025-05-07T20:24:52.4423714Z 
2025-05-07T20:24:52.4426602Z 
2025-05-07T20:24:52.4647511Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:52.4647835Z 
2025-05-07T20:24:52.4647840Z 
2025-05-07T20:24:52.4647844Z 
2025-05-07T20:24:52.4647848Z 
2025-05-07T20:24:52.4647852Z 
2025-05-07T20:24:52.4647863Z 
2025-05-07T20:24:52.4647876Z 
2025-05-07T20:24:52.4650393Z 
2025-05-07T20:24:52.4688410Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.4688824Z 
2025-05-07T20:24:52.4688838Z 
2025-05-07T20:24:52.4688853Z 
2025-05-07T20:24:52.4688857Z 
2025-05-07T20:24:52.4688861Z 
2025-05-07T20:24:52.4688864Z 
2025-05-07T20:24:52.4688868Z 
2025-05-07T20:24:52.4690680Z 
2025-05-07T20:24:52.4762945Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.4763265Z 
2025-05-07T20:24:52.4763269Z 
2025-05-07T20:24:52.4763273Z 
2025-05-07T20:24:52.4763276Z 
2025-05-07T20:24:52.4763280Z 
2025-05-07T20:24:52.4763284Z 
2025-05-07T20:24:52.4763288Z 
2025-05-07T20:24:52.4763291Z 
2025-05-07T20:24:52.4763295Z 
2025-05-07T20:24:52.4771330Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.4810964Z gcc_impl_linux-64-11 | 53.0 MB   | ##9        |  30% 
2025-05-07T20:24:52.4811324Z 
2025-05-07T20:24:52.4811329Z 
2025-05-07T20:24:52.4811347Z 
2025-05-07T20:24:52.4811354Z 
2025-05-07T20:24:52.4811359Z 
2025-05-07T20:24:52.4811365Z 
2025-05-07T20:24:52.4811370Z 
2025-05-07T20:24:52.4811376Z 
2025-05-07T20:24:52.4811381Z 
2025-05-07T20:24:52.5127296Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5127662Z 
2025-05-07T20:24:52.5127666Z 
2025-05-07T20:24:52.5127670Z 
2025-05-07T20:24:52.5127674Z 
2025-05-07T20:24:52.5127678Z 
2025-05-07T20:24:52.5127682Z 
2025-05-07T20:24:52.5127685Z 
2025-05-07T20:24:52.5127689Z 
2025-05-07T20:24:52.5127693Z 
2025-05-07T20:24:52.5127697Z 
2025-05-07T20:24:52.5159404Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5159717Z 
2025-05-07T20:24:52.5159722Z 
2025-05-07T20:24:52.5159725Z 
2025-05-07T20:24:52.5159729Z 
2025-05-07T20:24:52.5159733Z 
2025-05-07T20:24:52.5159737Z 
2025-05-07T20:24:52.5159740Z 
2025-05-07T20:24:52.5159744Z 
2025-05-07T20:24:52.5159748Z 
2025-05-07T20:24:52.5159984Z 
2025-05-07T20:24:52.5268767Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5269077Z 
2025-05-07T20:24:52.5269081Z 
2025-05-07T20:24:52.5269085Z 
2025-05-07T20:24:52.5269301Z 
2025-05-07T20:24:52.5269306Z 
2025-05-07T20:24:52.5269318Z 
2025-05-07T20:24:52.5269322Z 
2025-05-07T20:24:52.5269326Z 
2025-05-07T20:24:52.5269330Z 
2025-05-07T20:24:52.5269333Z 
2025-05-07T20:24:52.5269337Z 
2025-05-07T20:24:52.5304872Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5305309Z 
2025-05-07T20:24:52.5305314Z 
2025-05-07T20:24:52.5305317Z 
2025-05-07T20:24:52.5305321Z 
2025-05-07T20:24:52.5305325Z 
2025-05-07T20:24:52.5305329Z 
2025-05-07T20:24:52.5305333Z 
2025-05-07T20:24:52.5305336Z 
2025-05-07T20:24:52.5305340Z 
2025-05-07T20:24:52.5305344Z 
2025-05-07T20:24:52.5305849Z 
2025-05-07T20:24:52.5679123Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.5679502Z 
2025-05-07T20:24:52.5679507Z 
2025-05-07T20:24:52.5679510Z 
2025-05-07T20:24:52.5679514Z 
2025-05-07T20:24:52.5771580Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:52.6289543Z gcc_impl_linux-64-11 | 53.0 MB   | ###7       |  37% 
2025-05-07T20:24:52.6289961Z 
2025-05-07T20:24:52.6418736Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:52.6419013Z 
2025-05-07T20:24:52.6419720Z 
2025-05-07T20:24:52.6773854Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:52.7416749Z gcc_impl_linux-64-11 | 53.0 MB   | ####7      |  48% 
2025-05-07T20:24:52.7417008Z 
2025-05-07T20:24:52.7417012Z 
2025-05-07T20:24:52.7417016Z 
2025-05-07T20:24:52.7417020Z 
2025-05-07T20:24:52.7417024Z 
2025-05-07T20:24:52.7419037Z 
2025-05-07T20:24:52.7664756Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:52.7665087Z 
2025-05-07T20:24:52.7665091Z 
2025-05-07T20:24:52.7665095Z 
2025-05-07T20:24:52.7665108Z 
2025-05-07T20:24:52.7665503Z 
2025-05-07T20:24:52.7777075Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:52.8418213Z gcc_impl_linux-64-11 | 53.0 MB   | #####5     |  55% 
2025-05-07T20:24:52.8418552Z 
2025-05-07T20:24:52.8418556Z 
2025-05-07T20:24:52.8418560Z 
2025-05-07T20:24:52.8418571Z 
2025-05-07T20:24:52.8418575Z 
2025-05-07T20:24:52.8418579Z 
2025-05-07T20:24:52.8420704Z 
2025-05-07T20:24:52.8429753Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:52.8430198Z 
2025-05-07T20:24:52.8430203Z 
2025-05-07T20:24:52.8430207Z 
2025-05-07T20:24:52.8430215Z 
2025-05-07T20:24:52.8430220Z 
2025-05-07T20:24:52.8430224Z 
2025-05-07T20:24:52.8430529Z 
2025-05-07T20:24:52.8477297Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:52.8477668Z 
2025-05-07T20:24:52.8477672Z 
2025-05-07T20:24:52.8477675Z 
2025-05-07T20:24:52.8477679Z 
2025-05-07T20:24:52.8477683Z 
2025-05-07T20:24:52.8477695Z 
2025-05-07T20:24:52.8477699Z 
2025-05-07T20:24:52.8478132Z 
2025-05-07T20:24:52.8487691Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.8488047Z 
2025-05-07T20:24:52.8488059Z 
2025-05-07T20:24:52.8488063Z 
2025-05-07T20:24:52.8488067Z 
2025-05-07T20:24:52.8488116Z 
2025-05-07T20:24:52.8488121Z 
2025-05-07T20:24:52.8488125Z 
2025-05-07T20:24:52.8488426Z 
2025-05-07T20:24:52.8779462Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9053718Z gcc_impl_linux-64-11 | 53.0 MB   | ######5    |  66% 
2025-05-07T20:24:52.9053983Z 
2025-05-07T20:24:52.9053988Z 
2025-05-07T20:24:52.9053991Z 
2025-05-07T20:24:52.9054002Z 
2025-05-07T20:24:52.9054005Z 
2025-05-07T20:24:52.9054160Z 
2025-05-07T20:24:52.9054166Z 
2025-05-07T20:24:52.9054171Z 
2025-05-07T20:24:52.9054388Z 
2025-05-07T20:24:52.9061360Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9061738Z 
2025-05-07T20:24:52.9061948Z 
2025-05-07T20:24:52.9061953Z 
2025-05-07T20:24:52.9061997Z 
2025-05-07T20:24:52.9062001Z 
2025-05-07T20:24:52.9062004Z 
2025-05-07T20:24:52.9062008Z 
2025-05-07T20:24:52.9062011Z 
2025-05-07T20:24:52.9062015Z 
2025-05-07T20:24:52.9397689Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9398038Z 
2025-05-07T20:24:52.9398041Z 
2025-05-07T20:24:52.9398045Z 
2025-05-07T20:24:52.9398049Z 
2025-05-07T20:24:52.9398052Z 
2025-05-07T20:24:52.9398056Z 
2025-05-07T20:24:52.9398060Z 
2025-05-07T20:24:52.9398064Z 
2025-05-07T20:24:52.9398067Z 
2025-05-07T20:24:52.9399458Z 
2025-05-07T20:24:52.9409894Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9410303Z 
2025-05-07T20:24:52.9410307Z 
2025-05-07T20:24:52.9410311Z 
2025-05-07T20:24:52.9410315Z 
2025-05-07T20:24:52.9410319Z 
2025-05-07T20:24:52.9410322Z 
2025-05-07T20:24:52.9410326Z 
2025-05-07T20:24:52.9410330Z 
2025-05-07T20:24:52.9410334Z 
2025-05-07T20:24:52.9410978Z 
2025-05-07T20:24:52.9597333Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9597674Z 
2025-05-07T20:24:52.9597678Z 
2025-05-07T20:24:52.9597692Z 
2025-05-07T20:24:52.9597696Z 
2025-05-07T20:24:52.9597700Z 
2025-05-07T20:24:52.9597703Z 
2025-05-07T20:24:52.9597707Z 
2025-05-07T20:24:52.9597750Z 
2025-05-07T20:24:52.9597754Z 
2025-05-07T20:24:52.9597757Z 
2025-05-07T20:24:52.9600526Z 
2025-05-07T20:24:52.9604889Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:52.9605398Z 
2025-05-07T20:24:52.9605403Z 
2025-05-07T20:24:52.9605407Z 
2025-05-07T20:24:52.9605411Z 
2025-05-07T20:24:52.9605415Z 
2025-05-07T20:24:52.9605419Z 
2025-05-07T20:24:52.9605423Z 
2025-05-07T20:24:52.9605427Z 
2025-05-07T20:24:52.9605430Z 
2025-05-07T20:24:52.9605434Z 
2025-05-07T20:24:52.9607087Z 
2025-05-07T20:24:52.9783330Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:53.0785298Z gcc_impl_linux-64-11 | 53.0 MB   | #######6   |  77% 
2025-05-07T20:24:53.1787233Z gcc_impl_linux-64-11 | 53.0 MB   | ########6  |  87% 
2025-05-07T20:24:53.2355616Z gcc_impl_linux-64-11 | 53.0 MB   | #########8 |  99% 
2025-05-07T20:24:53.2355928Z 
2025-05-07T20:24:53.2355933Z 
2025-05-07T20:24:53.2356736Z 
2025-05-07T20:24:53.3520775Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:24:53.4190813Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:53.4191687Z 
2025-05-07T20:24:53.7291028Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:24:53.7291841Z 
2025-05-07T20:24:53.7291849Z 
2025-05-07T20:24:54.1834995Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:24:54.1842833Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:24:54.1843593Z                                                      
2025-05-07T20:24:54.1844006Z 
2025-05-07T20:24:54.1844352Z                                                      [A
2025-05-07T20:24:54.1845180Z 
2025-05-07T20:24:54.1845185Z 
2025-05-07T20:24:54.1845542Z                                                      [A[A
2025-05-07T20:24:54.1845824Z 
2025-05-07T20:24:54.1845840Z 
2025-05-07T20:24:54.1845844Z 
2025-05-07T20:24:54.1846105Z                                                      [A[A[A
2025-05-07T20:24:54.1846395Z 
2025-05-07T20:24:54.1846401Z 
2025-05-07T20:24:54.1846406Z 
2025-05-07T20:24:54.1846411Z 
2025-05-07T20:24:54.1846772Z                                                      [A[A[A[A
2025-05-07T20:24:54.1847101Z 
2025-05-07T20:24:54.1847105Z 
2025-05-07T20:24:54.1847109Z 
2025-05-07T20:24:54.1847113Z 
2025-05-07T20:24:54.1847117Z 
2025-05-07T20:24:54.1847476Z                                                      [A[A[A[A[A
2025-05-07T20:24:54.1847744Z 
2025-05-07T20:24:54.1847748Z 
2025-05-07T20:24:54.1847751Z 
2025-05-07T20:24:54.1847755Z 
2025-05-07T20:24:54.1847759Z 
2025-05-07T20:24:54.1847763Z 
2025-05-07T20:24:54.1848266Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:54.1848545Z 
2025-05-07T20:24:54.1848549Z 
2025-05-07T20:24:54.1848553Z 
2025-05-07T20:24:54.1848557Z 
2025-05-07T20:24:54.1848696Z 
2025-05-07T20:24:54.1848700Z 
2025-05-07T20:24:54.1848704Z 
2025-05-07T20:24:54.1849025Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:54.1849312Z 
2025-05-07T20:24:54.1849316Z 
2025-05-07T20:24:54.1849320Z 
2025-05-07T20:24:54.1849324Z 
2025-05-07T20:24:54.1849328Z 
2025-05-07T20:24:54.1849332Z 
2025-05-07T20:24:54.1849336Z 
2025-05-07T20:24:54.1849340Z 
2025-05-07T20:24:54.1849609Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:54.1849914Z 
2025-05-07T20:24:54.1849918Z 
2025-05-07T20:24:54.1849922Z 
2025-05-07T20:24:54.1849926Z 
2025-05-07T20:24:54.1849930Z 
2025-05-07T20:24:54.1849934Z 
2025-05-07T20:24:54.1849976Z 
2025-05-07T20:24:54.1849981Z 
2025-05-07T20:24:54.1849991Z 
2025-05-07T20:24:54.1850227Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.1850485Z 
2025-05-07T20:24:54.1850489Z 
2025-05-07T20:24:54.1850493Z 
2025-05-07T20:24:54.1850594Z 
2025-05-07T20:24:54.1850599Z 
2025-05-07T20:24:54.1850603Z 
2025-05-07T20:24:54.1850607Z 
2025-05-07T20:24:54.1850611Z 
2025-05-07T20:24:54.1850615Z 
2025-05-07T20:24:54.1850618Z 
2025-05-07T20:24:54.1850873Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:24:54.1851215Z 
2025-05-07T20:24:54.1851219Z 
2025-05-07T20:24:54.1851223Z 
2025-05-07T20:24:54.1851227Z 
2025-05-07T20:24:54.1851231Z 
2025-05-07T20:24:54.1851235Z 
2025-05-07T20:24:54.1851238Z 
2025-05-07T20:24:54.1851242Z 
2025-05-07T20:24:54.1851246Z 
2025-05-07T20:24:54.1851250Z 
2025-05-07T20:24:54.1851254Z 
2025-05-07T20:24:54.1851504Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:54.2852396Z Preparing transaction: \ done
2025-05-07T20:24:54.5865185Z Verifying transaction: / - \ done
2025-05-07T20:24:54.7875770Z Executing transaction: / - done
2025-05-07T20:24:54.9583037Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:24:59.0368406Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:24:59.0369179Z 
2025-05-07T20:24:59.0382829Z 
2025-05-07T20:24:59.0400569Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:24:59.0401196Z 
2025-05-07T20:24:59.0412433Z 
2025-05-07T20:24:59.0430336Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:24:59.0430946Z 
2025-05-07T20:24:59.0443098Z 
2025-05-07T20:24:59.0460282Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:24:59.0460903Z 
2025-05-07T20:24:59.0472195Z 
2025-05-07T20:25:00.9522511Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:00.9523107Z 
2025-05-07T20:25:01.0164479Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:03.0315009Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:03.0315504Z 
2025-05-07T20:25:03.0983746Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:05.1130971Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:05.1131441Z 
2025-05-07T20:25:05.1770631Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:07.0851163Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:07.0851639Z 
2025-05-07T20:25:07.1470430Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:07.1475574Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:07.1476901Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:07.1477795Z 
2025-05-07T20:25:09.0805359Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:09.0806686Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:09.0807221Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:09.0807655Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:09.0808667Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:09.0809620Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:09.0810151Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:09.0810730Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:09.0811118Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:09.0811810Z #define __CHAR_BIT__ 8
2025-05-07T20:25:09.0813527Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:09.0814126Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:09.0814665Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:09.0815190Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:09.0815746Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:09.0816174Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0816596Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:09.0817138Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:09.0817570Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:09.0818016Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:09.0818675Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:09.0819274Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:09.0819727Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:09.0820244Z #define __GCC_IEC_559 2
2025-05-07T20:25:09.0820630Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:09.0820983Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:09.0821484Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:09.0821911Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:09.0822322Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0822897Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:09.0823299Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.0824048Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:09.0824471Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:09.0824885Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:09.0825383Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:09.0825761Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:09.0826149Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:09.0826596Z #define __INT8_C(c) c
2025-05-07T20:25:09.0826938Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:09.0827368Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0827897Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:09.0828392Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:09.0828866Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:09.0829332Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:09.0829744Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0830143Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:09.0830592Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:09.0831152Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:09.0831701Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:09.0832158Z #define __linux 1
2025-05-07T20:25:09.0832529Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:09.0832969Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:09.0833372Z #define __unix 1
2025-05-07T20:25:09.0833917Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:09.0834352Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:09.0834756Z #define __WINT_MIN__ 0U
2025-05-07T20:25:09.0835170Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.0835582Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:09.0835984Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:09.0836423Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:09.0836804Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:09.0837273Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:09.0837688Z #define __INT64_C(c) c ## L
2025-05-07T20:25:09.0838360Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:09.0838910Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:09.0839270Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:09.0839758Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:09.0840507Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:09.0840851Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:09.0841245Z #define __DBL_DIG__ 15
2025-05-07T20:25:09.0841681Z #define __FLT32_DIG__ 6
2025-05-07T20:25:09.0842120Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:09.0842593Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:09.0843034Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:09.0843498Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:09.0843964Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:09.0844402Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:09.0844799Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:09.0845307Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:09.0845908Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:09.0846312Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:09.0846706Z #define __unix__ 1
2025-05-07T20:25:09.0847107Z #define __INT_WIDTH__ 32
2025-05-07T20:25:09.0847505Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:09.0847859Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:09.0848288Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:09.0848800Z #define __UINT16_C(c) c
2025-05-07T20:25:09.0849170Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:09.0849591Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:09.0850104Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:09.0850585Z #define __gnu_linux__ 1
2025-05-07T20:25:09.0851059Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:09.0851450Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.0851847Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0852352Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:09.0852710Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:09.0853066Z #define __GNUC__ 11
2025-05-07T20:25:09.0853501Z #define __pie__ 2
2025-05-07T20:25:09.0853873Z #define __MMX__ 1
2025-05-07T20:25:09.0854218Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:09.0854512Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:09.0854826Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:09.0855130Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:09.0855509Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:09.0855949Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0856298Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:09.0856583Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:09.0867442Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:09.0867823Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:09.0868121Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:09.0868427Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:09.0868756Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:09.0869123Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:09.0869454Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:09.0869796Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:09.0870081Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:09.0870380Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:09.0870688Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:09.0870975Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:09.0871263Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:09.0871618Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:09.0872018Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:09.0872315Z #define __SSE2_MATH__ 1
2025-05-07T20:25:09.0872587Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:09.0872927Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0873247Z #define __amd64 1
2025-05-07T20:25:09.0873632Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:09.0874217Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:09.0874557Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:09.0874904Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:09.0875194Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:09.0875637Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:09.0875925Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:09.0876224Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:09.0876511Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:09.0876809Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:09.0877107Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:09.0877418Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:09.0877688Z #define __x86_64 1
2025-05-07T20:25:09.0877949Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:09.0878364Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:09.0878868Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:09.0879381Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:09.0879897Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:09.0880317Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:09.0880597Z #define __LP64__ 1
2025-05-07T20:25:09.0880865Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0881255Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:09.0881669Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:09.0881977Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:09.0882286Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.0882595Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:09.0882901Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:09.0883202Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:09.0883485Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:09.0883776Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:09.0884070Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:09.0884437Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:09.0884840Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:09.0885152Z #define __FLT_DIG__ 6
2025-05-07T20:25:09.0885404Z #define __NO_INLINE__ 1
2025-05-07T20:25:09.0885678Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:09.0886051Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:09.0886442Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:09.0886723Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:09.0887019Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:09.0887309Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:09.0887591Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:09.0887879Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:09.0888210Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:09.0888523Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:09.0888821Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:09.0889159Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:09.0889524Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:09.0889823Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:09.0890111Z #define __FLT128_DIG__ 33
2025-05-07T20:25:09.0890371Z #define __INT32_C(c) c
2025-05-07T20:25:09.0890639Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:09.0890954Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:09.0891262Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:09.0891570Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:09.0891927Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:09.0892267Z #define unix 1
2025-05-07T20:25:09.0892519Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:09.0892868Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0893206Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:09.0893547Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:09.0893911Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:09.0894193Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:09.0894476Z #define __ELF__ 1
2025-05-07T20:25:09.0894877Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:09.0895198Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:09.0895496Z #define __FLT_RADIX__ 2
2025-05-07T20:25:09.0895857Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:09.0896341Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:09.0896742Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:09.0897025Z #define __SSE_MATH__ 1
2025-05-07T20:25:09.0897274Z #define __k8 1
2025-05-07T20:25:09.0897603Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:09.0898018Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:09.0898348Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:09.0898674Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:09.0898963Z #define __LDBL_DIG__ 18
2025-05-07T20:25:09.0899236Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:09.0899516Z #define __x86_64__ 1
2025-05-07T20:25:09.0899786Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:09.0900128Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:09.0900495Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0900834Z #define __FLT64_DIG__ 15
2025-05-07T20:25:09.0901153Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0901541Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:09.0901898Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0902197Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:09.0902500Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0902837Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:09.0903243Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:09.0903686Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:09.0904008Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:09.0904382Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:09.0904742Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:09.0905072Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:09.0905383Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:09.0905724Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:09.0906030Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:09.0906302Z #define __SEG_FS 1
2025-05-07T20:25:09.0906558Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:09.0906859Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:09.0907166Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0907484Z #define __SEG_GS 1
2025-05-07T20:25:09.0907830Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:09.0908244Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:09.0908549Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:09.0908867Z #define __INT16_TYPE__ short int
2025-05-07T20:25:09.0909172Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:09.0909503Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:09.0909798Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:09.0910078Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:09.0910371Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:09.0910751Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:09.0911170Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0911495Z #define linux 1
2025-05-07T20:25:09.0911748Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0912054Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:09.0912351Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:09.0912632Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:09.0912924Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:09.0913214Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:09.0913763Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:09.0914228Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:09.0914590Z #define __code_model_small__ 1
2025-05-07T20:25:09.0914901Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:09.0915223Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:09.0915634Z #define __k8__ 1
2025-05-07T20:25:09.0915897Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:09.0916229Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:09.0916559Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:09.0916922Z #define __pic__ 2
2025-05-07T20:25:09.0917211Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0917561Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:09.0917881Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0918263Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:09.0918676Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:09.0919071Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:09.0919379Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:09.0919709Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:09.0920049Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:09.0920329Z #define __linux__ 1
2025-05-07T20:25:09.0920593Z #define __INT64_TYPE__ long int
2025-05-07T20:25:09.0920881Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:09.0921175Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:09.0921478Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:09.0921757Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:09.0922091Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0922468Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:09.0922798Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:09.0923088Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:09.0923415Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:09.0923746Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:09.0924548Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:09.0925031Z #define __SSE__ 1
2025-05-07T20:25:09.0925286Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:09.0925657Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:09.0926038Z #define __amd64__ 1
2025-05-07T20:25:09.0926299Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:09.0926576Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:09.0926878Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:09.0927182Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:09.0927483Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:09.0927790Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:09.0928081Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:09.0928379Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:09.0928675Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:09.0929067Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:09.0929583Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:09.0929970Z #define _LP64 1
2025-05-07T20:25:09.0930210Z #define __UINT8_C(c) c
2025-05-07T20:25:09.0930480Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:09.0930771Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:09.0931071Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:09.0931383Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:09.0931715Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:09.0932113Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:09.0932635Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:09.0933050Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0933373Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:09.0933725Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:09.0934133Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:09.0934534Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:09.0934832Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:09.0935213Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:09.0935615Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:09.0935908Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:09.0936193Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:09.0936739Z #define __FXSR__ 1
2025-05-07T20:25:09.0937083Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:09.0937591Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:09.0938170Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:09.0938510Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:09.0938796Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:09.0939166Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:09.0939556Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:09.0939830Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:09.0940095Z #define __PIC__ 2
2025-05-07T20:25:09.0940368Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:09.0940809Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:09.0941241Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:09.0941615Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:09.0941982Z #define __SSE2__ 1
2025-05-07T20:25:09.0942231Z #define __INT32_TYPE__ int
2025-05-07T20:25:09.0942512Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:09.0942796Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:09.0943179Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:09.0943577Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:09.0943875Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:09.0944176Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:09.0944481Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0944783Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:09.0945061Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:09.0945338Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:09.0945654Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0945987Z #define __PIE__ 2
2025-05-07T20:25:09.0946344Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:09.0946775Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:09.0947164Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:09.0947571Z #define __INT16_C(c) c
2025-05-07T20:25:09.0947823Z #define __STDC__ 1
2025-05-07T20:25:09.0948084Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:09.0948393Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:09.0948681Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:09.0949012Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:09.0949406Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:09.0949778Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:09.0950071Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:09.0950389Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:09.0950688Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:09.0951000Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:09.0951327Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:09.0951635Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:09.0951975Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:09.0952411Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:09.0952827Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:09.0953174Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:09.0953650Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:09.0953934Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:09.0954113Z 
2025-05-07T20:25:09.1480067Z 
2025-05-07T20:25:09.1480968Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:09.1481664Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:09.1482015Z 
2025-05-07T20:25:11.1701312Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:11.1701829Z #define __cpp_attributes 200809L
2025-05-07T20:25:11.1702388Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:11.1702931Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:11.1703367Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:11.1704025Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:11.1704399Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:11.1704776Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:11.1705091Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:11.1705624Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:11.1706124Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:11.1706553Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:11.1706940Z #define __CHAR_BIT__ 8
2025-05-07T20:25:11.1707325Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:11.1707770Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:11.1708243Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:11.1708560Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:11.1708868Z #define __cpp_static_assert 201411L
2025-05-07T20:25:11.1709197Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:11.1709538Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1709868Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:11.1710204Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:11.1710574Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:11.1710938Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:11.1711384Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:11.1711849Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:11.1712199Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:11.1712507Z #define __GCC_IEC_559 2
2025-05-07T20:25:11.1712793Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:11.1713103Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:11.1713404Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:11.1713910Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:11.1714241Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:11.1714591Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:11.1714940Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:11.1715313Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1715676Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:11.1715978Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:11.1716284Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:11.1716595Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:11.1716927Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:11.1717223Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:11.1717519Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:11.1717828Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:11.1718197Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:11.1718563Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:11.1718843Z #define __INT8_C(c) c
2025-05-07T20:25:11.1719150Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:11.1719465Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:11.1719817Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1720181Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:11.1720493Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:11.1720818Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:11.1721171Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:11.1721566Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:11.1721890Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:11.1722194Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:11.1722489Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1722799Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:11.1723105Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:11.1723537Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:11.1724403Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:11.1724749Z #define __linux 1
2025-05-07T20:25:11.1725017Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:11.1725354Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:11.1725685Z #define __unix 1
2025-05-07T20:25:11.1725952Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:11.1726543Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:11.1726873Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:11.1727168Z #define __WINT_MIN__ 0U
2025-05-07T20:25:11.1727446Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:11.1727880Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:11.1728178Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:11.1728478Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:11.1728763Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:11.1729069Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:11.1729399Z #define __INT64_C(c) c ## L
2025-05-07T20:25:11.1729700Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:11.1730024Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:11.1730335Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:11.1730669Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:11.1730970Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:11.1731268Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:11.1731661Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:11.1732073Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:11.1732344Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:11.1732651Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:11.1732966Z #define __DBL_DIG__ 15
2025-05-07T20:25:11.1733214Z #define __FLT32_DIG__ 6
2025-05-07T20:25:11.1733549Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:11.1733923Z #define __GXX_WEAK__ 1
2025-05-07T20:25:11.1734205Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:11.1734482Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:11.1734843Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:11.1735220Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:11.1735515Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:11.1735852Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:11.1736209Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:11.1736659Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:11.1737095Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:11.1737393Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:11.1737688Z #define __unix__ 1
2025-05-07T20:25:11.1737937Z #define __INT_WIDTH__ 32
2025-05-07T20:25:11.1738201Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:11.1738476Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:11.1738765Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:11.1748836Z #define __UINT16_C(c) c
2025-05-07T20:25:11.1749170Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:11.1749472Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:11.1749911Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:11.1750365Z #define __gnu_linux__ 1
2025-05-07T20:25:11.1750642Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:11.1750936Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:11.1751254Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:11.1751580Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1751885Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:11.1752179Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:11.1752452Z #define __GNUC__ 11
2025-05-07T20:25:11.1752698Z #define __GXX_RTTI 1
2025-05-07T20:25:11.1752947Z #define __pie__ 2
2025-05-07T20:25:11.1753177Z #define __MMX__ 1
2025-05-07T20:25:11.1753425Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:11.1753851Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:11.1754157Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:11.1754457Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:11.1754738Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:11.1755069Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:11.1755423Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:11.1755813Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:11.1756230Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:11.1756562Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1757134Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:11.1757434Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:11.1757724Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:11.1758066Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:11.1758495Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:11.1758782Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:11.1759071Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:11.1759388Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:11.1759707Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:11.1760009Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:11.1760317Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:11.1760591Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:11.1760887Z #define __cplusplus 201703L
2025-05-07T20:25:11.1761184Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:11.1761498Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:11.1761775Z #define __DEPRECATED 1
2025-05-07T20:25:11.1762066Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:11.1762393Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:11.1762671Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:11.1763021Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:11.1763419Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:11.1763711Z #define __SSE2_MATH__ 1
2025-05-07T20:25:11.1763987Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:11.1764323Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1764639Z #define __amd64 1
2025-05-07T20:25:11.1764889Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:11.1765188Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:11.1765476Z #define __GNUG__ 11
2025-05-07T20:25:11.1765759Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:11.1766105Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:11.1766388Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:11.1766670Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:11.1766975Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:11.1767266Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:11.1767567Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:11.1767893Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:11.1768186Z #define __cpp_hex_float 201603L
2025-05-07T20:25:11.1768484Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:11.1768780Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:11.1769084Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:11.1769373Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:11.1769672Z #define __x86_64 1
2025-05-07T20:25:11.1769928Z #define __cpp_lambdas 200907L
2025-05-07T20:25:11.1770220Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:11.1770631Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:11.1771057Z #define __cpp_template_auto 201606L
2025-05-07T20:25:11.1771455Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:11.1771951Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:11.1772467Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:11.1772898Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:11.1773173Z #define __LP64__ 1
2025-05-07T20:25:11.1773434Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1773823Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:11.1774234Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:11.1774541Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:11.1774855Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:11.1775160Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:11.1775457Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:11.1775747Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:11.1776042Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:11.1776404Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:11.1776802Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:11.1777283Z #define __FLT_DIG__ 6
2025-05-07T20:25:11.1777539Z #define __NO_INLINE__ 1
2025-05-07T20:25:11.1777807Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:11.1778166Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:11.1778676Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:11.1778962Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:11.1779251Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:11.1779527Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:11.1779837Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:11.1780165Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:11.1780449Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:11.1780768Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:11.1781081Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:11.1781378Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:11.1781709Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:11.1782087Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:11.1782413Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:11.1782701Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:11.1782988Z #define __FLT128_DIG__ 33
2025-05-07T20:25:11.1783254Z #define __INT32_C(c) c
2025-05-07T20:25:11.1783514Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:11.1783831Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:11.1784137Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:11.1784443Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:11.1784790Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:11.1785129Z #define unix 1
2025-05-07T20:25:11.1785379Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:11.1785660Z #define __cpp_rtti 199711L
2025-05-07T20:25:11.1785953Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:11.1786299Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1786627Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:11.1786969Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:11.1787333Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:11.1787613Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:11.1787933Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:11.1788247Z #define __ELF__ 1
2025-05-07T20:25:11.1788499Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:11.1788819Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:11.1789125Z #define __FLT_RADIX__ 2
2025-05-07T20:25:11.1789740Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:11.1790147Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:11.1790551Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:11.1790854Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:11.1791157Z #define __k8 1
2025-05-07T20:25:11.1791488Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:11.1791896Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:11.1792217Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:11.1792547Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:11.1792839Z #define __LDBL_DIG__ 18
2025-05-07T20:25:11.1793102Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:11.1793387Z #define __x86_64__ 1
2025-05-07T20:25:11.1793763Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:11.1794092Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:11.1794468Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1794802Z #define __FLT64_DIG__ 15
2025-05-07T20:25:11.1795113Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1795491Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:11.1795838Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1796134Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:11.1796436Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1796767Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:11.1797169Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:11.1797599Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:11.1798043Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:11.1798401Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:11.1798747Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:11.1799104Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:11.1799524Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:11.1799834Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:11.1800168Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:11.1800479Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:11.1800743Z #define __SEG_FS 1
2025-05-07T20:25:11.1800994Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:11.1801300Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:11.1801607Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1801917Z #define __SEG_GS 1
2025-05-07T20:25:11.1802265Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:11.1802685Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:11.1802981Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:11.1803307Z #define __INT16_TYPE__ short int
2025-05-07T20:25:11.1803620Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:11.1803954Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:11.1804283Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:11.1804567Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:11.1804858Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:11.1805227Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:11.1805654Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1806005Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:11.1806359Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:11.1806688Z #define linux 1
2025-05-07T20:25:11.1806943Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1807246Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:11.1807549Z #define __EXCEPTIONS 1
2025-05-07T20:25:11.1807819Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:11.1808112Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:11.1808413Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:11.1808727Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:11.1809108Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:11.1809536Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:11.1809924Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:11.1810282Z #define __code_model_small__ 1
2025-05-07T20:25:11.1810583Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:11.1810923Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:11.1811249Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:11.1811557Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:11.1811877Z #define __k8__ 1
2025-05-07T20:25:11.1812126Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:11.1812445Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:11.1812779Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:11.1813044Z #define __pic__ 2
2025-05-07T20:25:11.1813331Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1813679Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:11.1813978Z #define __cpp_decltype 200707L
2025-05-07T20:25:11.1814298Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1814677Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:11.1815087Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:11.1815477Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:11.1815808Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:11.1816164Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:11.1816479Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:11.1816768Z #define __linux__ 1
2025-05-07T20:25:11.1817026Z #define __INT64_TYPE__ long int
2025-05-07T20:25:11.1817312Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:11.1817605Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:11.1817907Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:11.1818338Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:11.1818684Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:11.1819011Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1819355Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:11.1819728Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:11.1820050Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:11.1820376Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:11.1820738Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:11.1821132Z #define __SSE__ 1
2025-05-07T20:25:11.1821389Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:11.1821760Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:11.1822136Z #define __amd64__ 1
2025-05-07T20:25:11.1822387Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:11.1822662Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:11.1822962Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:11.1823256Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:11.1823563Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:11.1824174Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:11.1824562Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:11.1824862Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:11.1825251Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:11.1825764Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:11.1826161Z #define _LP64 1
2025-05-07T20:25:11.1826396Z #define __UINT8_C(c) c
2025-05-07T20:25:11.1826671Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:11.1826972Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:11.1827271Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:11.1827574Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:11.1827974Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:11.1828488Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:11.1828898Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1829242Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:11.1829597Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:11.1829937Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:11.1830364Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:11.1830777Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:11.1831069Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:11.1831369Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:11.1831752Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:11.1832154Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:11.1832451Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:11.1832731Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:11.1833020Z #define __FXSR__ 1
2025-05-07T20:25:11.1833357Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:11.1833933Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:11.1834396Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:11.1834736Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:11.1835040Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:11.1835379Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:11.1835709Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:11.1836013Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:11.1836413Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:11.1836818Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:11.1837112Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:11.1837390Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:11.1837660Z #define __PIC__ 2
2025-05-07T20:25:11.1837938Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:11.1838384Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:11.1838813Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:11.1839463Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:11.1839853Z #define __cpp_constexpr 201603L
2025-05-07T20:25:11.1840144Z #define __SSE2__ 1
2025-05-07T20:25:11.1840406Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:11.1840731Z #define __INT32_TYPE__ int
2025-05-07T20:25:11.1841213Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:11.1841502Z #define __cpp_exceptions 199711L
2025-05-07T20:25:11.1841815Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:11.1842187Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:11.1842582Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:11.1842881Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:11.1843185Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:11.1843493Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1843796Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:11.1844076Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:11.1844365Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:11.1844686Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:11.1845018Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1845352Z #define __PIE__ 2
2025-05-07T20:25:11.1845704Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:11.1846177Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:11.1846523Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:11.1846913Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:11.1847314Z #define __INT16_C(c) c
2025-05-07T20:25:11.1847571Z #define __STDC__ 1
2025-05-07T20:25:11.1847821Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:11.1848103Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:11.1848413Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:11.1848707Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:11.1849036Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:11.1849431Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:11.1849806Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:11.1850104Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:11.1850432Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:11.1850755Z #define __SSE_MATH__ 1
2025-05-07T20:25:11.1851021Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:11.1851345Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:11.1851690Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:11.1852006Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:11.1852327Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:11.1852639Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:11.1852971Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:11.1853405Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:11.1853820Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:11.1854161Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:11.1854481Z #define _GNU_SOURCE 1
2025-05-07T20:25:11.1854758Z #define __cpp_init_captures 201304L
2025-05-07T20:25:11.1855078Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:11.1855353Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:11.1855538Z 
2025-05-07T20:25:11.2382114Z 
2025-05-07T20:25:11.2382762Z + conda run -n build_binary c++ --version
2025-05-07T20:25:11.2383067Z 
2025-05-07T20:25:13.2380608Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:13.2381022Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:13.2381489Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:13.2382049Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:13.2382394Z 
2025-05-07T20:25:13.2382399Z 
2025-05-07T20:25:13.3025153Z 
2025-05-07T20:25:13.3026168Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:13.3026985Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:13.3027446Z 
2025-05-07T20:25:15.2869407Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:15.2871818Z 
2025-05-07T20:25:15.2872623Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:15.2873225Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:15.2873829Z 
2025-05-07T20:25:17.2913622Z #define __cplusplus 201703L
2025-05-07T20:25:17.2916057Z 
2025-05-07T20:25:17.2916786Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:17.2969744Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:17.2970211Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:17.2982923Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:17.2983321Z env:
2025-05-07T20:25:17.2983575Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:17.2983919Z   BUILD_ENV: build_binary
2025-05-07T20:25:17.2984202Z   BUILD_TARGET: genai
2025-05-07T20:25:17.2984466Z   BUILD_VARIANT: cuda
2025-05-07T20:25:17.2984732Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:17.2985023Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:17.2985369Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:17.2985740Z ##[endgroup]
2025-05-07T20:25:17.6684008Z ################################################################################
2025-05-07T20:25:17.6684446Z # Install CUDA
2025-05-07T20:25:17.6684680Z #
2025-05-07T20:25:17.6702401Z # [2025-05-07T20:25:17.669Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:17.6702835Z ################################################################################
2025-05-07T20:25:17.6703075Z 
2025-05-07T20:25:17.6720703Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:17.7771574Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:17.7771961Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:17.7777607Z + conda clean --packages --tarball -y
2025-05-07T20:25:17.7777837Z 
2025-05-07T20:25:18.5240471Z Will remove 32 (142.2 MB) tarball(s).
2025-05-07T20:25:18.5240912Z Will remove 6 (617 KB) package(s).
2025-05-07T20:25:18.5900820Z 
2025-05-07T20:25:18.5912265Z + conda clean --all -y
2025-05-07T20:25:18.5912493Z 
2025-05-07T20:25:19.3035986Z There are no unused tarball(s) to remove.
2025-05-07T20:25:19.3036585Z Will remove 1 index cache(s).
2025-05-07T20:25:19.3037090Z There are no unused package(s) to remove.
2025-05-07T20:25:19.3037608Z There are no tempfile(s) to remove.
2025-05-07T20:25:19.3038108Z There are no logfile(s) to remove.
2025-05-07T20:25:19.3692366Z 
2025-05-07T20:25:19.3707528Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:19.3733362Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:20.3354892Z Channels:
2025-05-07T20:25:20.3355339Z  - conda-forge
2025-05-07T20:25:20.3355779Z Platform: linux-64
2025-05-07T20:25:31.2287990Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:32.4177749Z Solving environment: - \ | / - done
2025-05-07T20:25:32.4993604Z 
2025-05-07T20:25:32.4994321Z ## Package Plan ##
2025-05-07T20:25:32.4994694Z 
2025-05-07T20:25:32.4995036Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:32.4995548Z 
2025-05-07T20:25:32.4995700Z   added / updated specs:
2025-05-07T20:25:32.4996070Z     - cuda=12.8.0
2025-05-07T20:25:32.4996282Z 
2025-05-07T20:25:32.4996306Z 
2025-05-07T20:25:32.4996511Z The following packages will be downloaded:
2025-05-07T20:25:32.4996867Z 
2025-05-07T20:25:32.4997070Z     package                    |            build
2025-05-07T20:25:32.4997611Z     ---------------------------|-----------------
2025-05-07T20:25:32.4998187Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:25:32.4998779Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:25:32.4999234Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:25:32.4999717Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:25:32.5000187Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:25:32.5001446Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:25:32.5002220Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:32.5002782Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:25:32.5003319Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:25:32.5003849Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:25:32.5004360Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:32.5004887Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:25:32.5005444Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:25:32.5006004Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:32.5006589Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:25:32.5007175Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:25:32.5007717Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:25:32.5008220Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:25:32.5008721Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:25:32.5009234Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:25:32.5009747Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:25:32.5010298Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:25:32.5010824Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:25:32.5011318Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:32.5011850Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:32.5012376Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:25:32.5012863Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:25:32.5013370Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:25:32.5013904Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:25:32.5014421Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:25:32.5014943Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:25:32.5015457Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:25:32.5015972Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:25:32.5016474Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:25:32.5016980Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:25:32.5017488Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:25:32.5017987Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:25:32.5018481Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:25:32.5018990Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:25:32.5019520Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:25:32.5020038Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:25:32.5020535Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:25:32.5021016Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:25:32.5021634Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:25:32.5022285Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:25:32.5022807Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:32.5023336Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:25:32.5024289Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:25:32.5024786Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:25:32.5025269Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:25:32.5025836Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:25:32.5026354Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:25:32.5026814Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:25:32.5027264Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:32.5027795Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:25:32.5028377Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:25:32.5028949Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:25:32.5029502Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:25:32.5030005Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:25:32.5030519Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:25:32.5031052Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:25:32.5031547Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:25:32.5031997Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:32.5032451Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:25:32.5032926Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:25:32.5033350Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:25:32.5033885Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:25:32.5034340Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:25:32.5034787Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:25:32.5035253Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:25:32.5035810Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:25:32.5036313Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:25:32.5036819Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:25:32.5037318Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:25:32.5037826Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:25:32.5038334Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:25:32.5038833Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:25:32.5039350Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:25:32.5039878Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:25:32.5040403Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:25:32.5040925Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:25:32.5041450Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:25:32.5042099Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:32.5042705Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:25:32.5043212Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:25:32.5043721Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:25:32.5044210Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:25:32.5044670Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:25:32.5045155Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:25:32.5045639Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:25:32.5046094Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:25:32.5046554Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:25:32.5047042Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:25:32.5047515Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:32.5047965Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:25:32.5048450Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:25:32.5048969Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:25:32.5049493Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:25:32.5050014Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:25:32.5050532Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:25:32.5051037Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:25:32.5051538Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:25:32.5052019Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:25:32.5052487Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:25:32.5052975Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:25:32.5053456Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:25:32.5053926Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:32.5054382Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:25:32.5054855Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:25:32.5055355Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:25:32.5055831Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:25:32.5056296Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:25:32.5056742Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:25:32.5057241Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:25:32.5057738Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:25:32.5058170Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:25:32.5058607Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:25:32.5059110Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:25:32.5059607Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:25:32.5060083Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:25:32.5060695Z     python-3.10.13             |hd12c33a_1_cpython        24.5 MB  conda-forge
2025-05-07T20:25:32.5061303Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:25:32.5061770Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:25:32.5062213Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:25:32.5062879Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:25:32.5063342Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:25:32.5063828Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:25:32.5064345Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:25:32.5064863Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:25:32.5065405Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:25:32.5065923Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:25:32.5066444Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:25:32.5066957Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:25:32.5067435Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:25:32.5067920Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:25:32.5068409Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:25:32.5068934Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:25:32.5069473Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:25:32.5069992Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:32.5070501Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:25:32.5071018Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:25:32.5071511Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:25:32.5072010Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:25:32.5072537Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:25:32.5073045Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:25:32.5073594Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:25:32.5074030Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:25:32.5074456Z     ------------------------------------------------------------
2025-05-07T20:25:32.5074838Z                                            Total:        1.90 GB
2025-05-07T20:25:32.5075087Z 
2025-05-07T20:25:32.5075235Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:32.5075484Z 
2025-05-07T20:25:32.5075722Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:25:32.5076197Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:25:32.5076670Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:25:32.5077195Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:25:32.5077685Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:25:32.5078212Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:25:32.5078892Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:32.5079544Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:25:32.5080161Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:32.5080888Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:25:32.5081563Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:25:32.5082153Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:32.5082802Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:32.5083633Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:25:32.5084334Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:32.5085017Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:25:32.5085649Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5086231Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:25:32.5086812Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:25:32.5087417Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5088029Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:25:32.5088671Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:25:32.5089272Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:25:32.5089827Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:25:32.5090468Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:25:32.5091079Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:25:32.5091622Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:25:32.5092210Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:25:32.5092855Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:25:32.5093462Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:25:32.5094086Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:25:32.5094700Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5095286Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5095858Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:25:32.5096428Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5096989Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:25:32.5097551Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:32.5098121Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5098712Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:25:32.5099344Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:25:32.5099950Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:25:32.5100526Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:25:32.5101068Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5101656Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:32.5102298Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:25:32.5102914Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:25:32.5103535Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5104258Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:25:32.5104888Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:32.5105431Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:25:32.5106028Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:25:32.5106645Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:25:32.5107147Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:25:32.5107602Z   expat              conda-forge/linux-64::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:32.5108186Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:25:32.5108864Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:25:32.5109543Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:25:32.5110196Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:25:32.5110771Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:25:32.5111331Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:25:32.5111891Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:25:32.5112418Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:25:32.5112898Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:25:32.5113374Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:25:32.5113975Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:25:32.5114408Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:25:32.5114881Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:25:32.5115356Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:25:32.5115818Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:25:32.5116329Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:32.5116905Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:25:32.5117482Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:25:32.5118046Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:25:32.5118613Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:25:32.5119177Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:25:32.5119754Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:25:32.5120325Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:25:32.5120920Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:32.5121531Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:25:32.5122145Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:25:32.5122758Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:25:32.5123347Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:25:32.5124259Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:32.5124868Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:25:32.5125440Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:25:32.5126024Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:25:32.5126765Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:25:32.5127266Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:32.5127933Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:25:32.5128469Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:25:32.5128951Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:25:32.5129432Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:25:32.5129957Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:25:32.5130464Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:32.5130948Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:25:32.5131477Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:25:32.5132077Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:25:32.5132686Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:25:32.5133308Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:25:32.5133906Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:25:32.5134484Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:25:32.5135060Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:25:32.5135562Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:25:32.5136066Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:25:32.5136605Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:25:32.5137127Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:25:32.5137619Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:25:32.5138149Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:25:32.5138711Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:25:32.5139219Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:25:32.5139708Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:32.5140180Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:25:32.5140727Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:25:32.5152825Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:25:32.5153284Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:25:32.5153817Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:25:32.5154391Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:25:32.5154955Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:25:32.5155518Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:25:32.5156106Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:25:32.5156600Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:25:32.5157091Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:25:32.5157647Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:25:32.5158252Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:25:32.5158852Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:25:32.5159505Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:25:32.5160111Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:25:32.5160829Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:25:32.5161517Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:25:32.5162070Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:25:32.5162612Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:25:32.5163165Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:25:32.5163774Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:25:32.5164425Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:25:32.5165032Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:25:32.5165607Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:25:32.5166182Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:25:32.5166762Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:25:32.5167333Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:25:32.5167949Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:25:32.5168540Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:25:32.5169042Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:25:32.5169320Z 
2025-05-07T20:25:32.5169459Z The following packages will be UPDATED:
2025-05-07T20:25:32.5169690Z 
2025-05-07T20:25:32.5170005Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:32.5170674Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:25:32.5171050Z 
2025-05-07T20:25:32.5171301Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:32.5171650Z 
2025-05-07T20:25:32.5171980Z   python               pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 
2025-05-07T20:25:32.5172687Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:25:32.5173326Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:25:32.5173692Z 
2025-05-07T20:25:32.5173718Z 
2025-05-07T20:25:32.5173722Z 
2025-05-07T20:25:32.5173887Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:32.5174316Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:32.5174576Z 
2025-05-07T20:25:32.5175023Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:32.5175329Z 
2025-05-07T20:25:32.5175334Z 
2025-05-07T20:25:32.5175606Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:25:32.5175895Z 
2025-05-07T20:25:32.5175900Z 
2025-05-07T20:25:32.5175904Z 
2025-05-07T20:25:32.5176159Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:25:32.5176453Z 
2025-05-07T20:25:32.5176457Z 
2025-05-07T20:25:32.5176461Z 
2025-05-07T20:25:32.5176465Z 
2025-05-07T20:25:32.5177118Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:25:32.5177414Z 
2025-05-07T20:25:32.5177418Z 
2025-05-07T20:25:32.5177422Z 
2025-05-07T20:25:32.5177426Z 
2025-05-07T20:25:32.5177796Z 
2025-05-07T20:25:32.5189228Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:32.5189565Z 
2025-05-07T20:25:32.5189572Z 
2025-05-07T20:25:32.5189577Z 
2025-05-07T20:25:32.5189582Z 
2025-05-07T20:25:32.5189588Z 
2025-05-07T20:25:32.5189593Z 
2025-05-07T20:25:32.5191230Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:32.5191660Z 
2025-05-07T20:25:32.5191664Z 
2025-05-07T20:25:32.5191668Z 
2025-05-07T20:25:32.5191803Z 
2025-05-07T20:25:32.5191807Z 
2025-05-07T20:25:32.5191811Z 
2025-05-07T20:25:32.5194469Z 
2025-05-07T20:25:32.5196376Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:32.5196784Z 
2025-05-07T20:25:32.5196789Z 
2025-05-07T20:25:32.5196793Z 
2025-05-07T20:25:32.5196797Z 
2025-05-07T20:25:32.5196801Z 
2025-05-07T20:25:32.5196805Z 
2025-05-07T20:25:32.5196809Z 
2025-05-07T20:25:32.5198201Z 
2025-05-07T20:25:32.5200067Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5200419Z 
2025-05-07T20:25:32.5200425Z 
2025-05-07T20:25:32.5200430Z 
2025-05-07T20:25:32.5200436Z 
2025-05-07T20:25:32.5200442Z 
2025-05-07T20:25:32.5200448Z 
2025-05-07T20:25:32.5200453Z 
2025-05-07T20:25:32.5200458Z 
2025-05-07T20:25:32.5200464Z 
2025-05-07T20:25:32.5201918Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5202466Z 
2025-05-07T20:25:32.5202473Z 
2025-05-07T20:25:32.5202493Z 
2025-05-07T20:25:32.5202499Z 
2025-05-07T20:25:32.5202506Z 
2025-05-07T20:25:32.5202525Z 
2025-05-07T20:25:32.5202532Z 
2025-05-07T20:25:32.5202538Z 
2025-05-07T20:25:32.5202553Z 
2025-05-07T20:25:32.5202568Z 
2025-05-07T20:25:32.5203862Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5204413Z 
2025-05-07T20:25:32.5204420Z 
2025-05-07T20:25:32.5204427Z 
2025-05-07T20:25:32.5204433Z 
2025-05-07T20:25:32.5204439Z 
2025-05-07T20:25:32.5204445Z 
2025-05-07T20:25:32.5204451Z 
2025-05-07T20:25:32.5204457Z 
2025-05-07T20:25:32.5204463Z 
2025-05-07T20:25:32.5204469Z 
2025-05-07T20:25:32.5204475Z 
2025-05-07T20:25:32.5205608Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5206170Z 
2025-05-07T20:25:32.5206178Z 
2025-05-07T20:25:32.5206184Z 
2025-05-07T20:25:32.5206190Z 
2025-05-07T20:25:32.5206196Z 
2025-05-07T20:25:32.5206202Z 
2025-05-07T20:25:32.5206219Z 
2025-05-07T20:25:32.5206225Z 
2025-05-07T20:25:32.5206244Z 
2025-05-07T20:25:32.5206250Z 
2025-05-07T20:25:32.5206256Z 
2025-05-07T20:25:32.5206262Z 
2025-05-07T20:25:32.5207469Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5208035Z 
2025-05-07T20:25:32.5208042Z 
2025-05-07T20:25:32.5208048Z 
2025-05-07T20:25:32.5208054Z 
2025-05-07T20:25:32.5208060Z 
2025-05-07T20:25:32.5208066Z 
2025-05-07T20:25:32.5208072Z 
2025-05-07T20:25:32.5208078Z 
2025-05-07T20:25:32.5208095Z 
2025-05-07T20:25:32.5208102Z 
2025-05-07T20:25:32.5208116Z 
2025-05-07T20:25:32.5208122Z 
2025-05-07T20:25:32.5208128Z 
2025-05-07T20:25:32.5209365Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5210354Z 
2025-05-07T20:25:32.5210361Z 
2025-05-07T20:25:32.5210367Z 
2025-05-07T20:25:32.5210373Z 
2025-05-07T20:25:32.5210379Z 
2025-05-07T20:25:32.5210385Z 
2025-05-07T20:25:32.5210391Z 
2025-05-07T20:25:32.5210396Z 
2025-05-07T20:25:32.5210402Z 
2025-05-07T20:25:32.5210421Z 
2025-05-07T20:25:32.5210427Z 
2025-05-07T20:25:32.5210433Z 
2025-05-07T20:25:32.5210439Z 
2025-05-07T20:25:32.5210445Z 
2025-05-07T20:25:32.5211762Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5212322Z 
2025-05-07T20:25:32.5212328Z 
2025-05-07T20:25:32.5212334Z 
2025-05-07T20:25:32.5212340Z 
2025-05-07T20:25:32.5212346Z 
2025-05-07T20:25:32.5212351Z 
2025-05-07T20:25:32.5212357Z 
2025-05-07T20:25:32.5212363Z 
2025-05-07T20:25:32.5212368Z 
2025-05-07T20:25:32.5212374Z 
2025-05-07T20:25:32.5212379Z 
2025-05-07T20:25:32.5212385Z 
2025-05-07T20:25:32.5212391Z 
2025-05-07T20:25:32.5212407Z 
2025-05-07T20:25:32.5212728Z 
2025-05-07T20:25:32.5215696Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5216263Z 
2025-05-07T20:25:32.5216281Z 
2025-05-07T20:25:32.5216288Z 
2025-05-07T20:25:32.5216294Z 
2025-05-07T20:25:32.5216300Z 
2025-05-07T20:25:32.5216482Z 
2025-05-07T20:25:32.5216488Z 
2025-05-07T20:25:32.5216494Z 
2025-05-07T20:25:32.5216500Z 
2025-05-07T20:25:32.5216506Z 
2025-05-07T20:25:32.5216512Z 
2025-05-07T20:25:32.5216517Z 
2025-05-07T20:25:32.5216643Z 
2025-05-07T20:25:32.5216652Z 
2025-05-07T20:25:32.5216658Z 
2025-05-07T20:25:32.5216664Z 
2025-05-07T20:25:32.5217310Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5217899Z 
2025-05-07T20:25:32.5217906Z 
2025-05-07T20:25:32.5217912Z 
2025-05-07T20:25:32.5217917Z 
2025-05-07T20:25:32.5217924Z 
2025-05-07T20:25:32.5217930Z 
2025-05-07T20:25:32.5217936Z 
2025-05-07T20:25:32.5217941Z 
2025-05-07T20:25:32.5217947Z 
2025-05-07T20:25:32.5217952Z 
2025-05-07T20:25:32.5217958Z 
2025-05-07T20:25:32.5217964Z 
2025-05-07T20:25:32.5217969Z 
2025-05-07T20:25:32.5217975Z 
2025-05-07T20:25:32.5217999Z 
2025-05-07T20:25:32.5218006Z 
2025-05-07T20:25:32.5218012Z 
2025-05-07T20:25:32.5219382Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5220028Z 
2025-05-07T20:25:32.5220035Z 
2025-05-07T20:25:32.5220041Z 
2025-05-07T20:25:32.5220047Z 
2025-05-07T20:25:32.5220063Z 
2025-05-07T20:25:32.5220069Z 
2025-05-07T20:25:32.5220075Z 
2025-05-07T20:25:32.5220082Z 
2025-05-07T20:25:32.5220088Z 
2025-05-07T20:25:32.5220095Z 
2025-05-07T20:25:32.5220101Z 
2025-05-07T20:25:32.5220107Z 
2025-05-07T20:25:32.5220113Z 
2025-05-07T20:25:32.5220119Z 
2025-05-07T20:25:32.5220125Z 
2025-05-07T20:25:32.5220132Z 
2025-05-07T20:25:32.5220138Z 
2025-05-07T20:25:32.5220144Z 
2025-05-07T20:25:32.5221205Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.5221822Z 
2025-05-07T20:25:32.5221830Z 
2025-05-07T20:25:32.5221835Z 
2025-05-07T20:25:32.5221841Z 
2025-05-07T20:25:32.5221848Z 
2025-05-07T20:25:32.5221854Z 
2025-05-07T20:25:32.5221860Z 
2025-05-07T20:25:32.5221867Z 
2025-05-07T20:25:32.5221885Z 
2025-05-07T20:25:32.5221903Z 
2025-05-07T20:25:32.5221910Z 
2025-05-07T20:25:32.5221916Z 
2025-05-07T20:25:32.5221922Z 
2025-05-07T20:25:32.5221928Z 
2025-05-07T20:25:32.5221934Z 
2025-05-07T20:25:32.5221951Z 
2025-05-07T20:25:32.5221957Z 
2025-05-07T20:25:32.5221962Z 
2025-05-07T20:25:32.5221968Z 
2025-05-07T20:25:32.6116612Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:32.6117272Z 
2025-05-07T20:25:32.6126601Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:25:32.6126908Z 
2025-05-07T20:25:32.6126912Z 
2025-05-07T20:25:32.6130112Z libcusparse-12.5.7.5 | 164.9 MB  | 2          |   2% [A[A
2025-05-07T20:25:32.6130453Z 
2025-05-07T20:25:32.6130458Z 
2025-05-07T20:25:32.6130463Z 
2025-05-07T20:25:32.6174141Z libcusolver-11.7.2.5 | 156.9 MB  |            |   1% [A[A[A
2025-05-07T20:25:32.6174460Z 
2025-05-07T20:25:32.6174467Z 
2025-05-07T20:25:32.6174474Z 
2025-05-07T20:25:32.6176183Z 
2025-05-07T20:25:32.6323713Z libcufft-11.3.3.41   | 147.4 MB  | 1          |   2% [A[A[A[A
2025-05-07T20:25:32.7122486Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:25:32.7123562Z 
2025-05-07T20:25:32.7128500Z nsight-compute-2025. | 320.6 MB  |            |   1% [A
2025-05-07T20:25:32.7129229Z 
2025-05-07T20:25:32.7129236Z 
2025-05-07T20:25:32.7132772Z libcusparse-12.5.7.5 | 164.9 MB  | 4          |   4% [A[A
2025-05-07T20:25:32.7133157Z 
2025-05-07T20:25:32.7133163Z 
2025-05-07T20:25:32.7135802Z 
2025-05-07T20:25:32.7181385Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   3% [A[A[A
2025-05-07T20:25:32.7181772Z 
2025-05-07T20:25:32.7181778Z 
2025-05-07T20:25:32.7181784Z 
2025-05-07T20:25:32.7181789Z 
2025-05-07T20:25:32.7325450Z libcufft-11.3.3.41   | 147.4 MB  | 3          |   4% [A[A[A[A
2025-05-07T20:25:32.8124278Z libcublas-12.8.3.14  | 460.2 MB  |            |   1% 
2025-05-07T20:25:32.8127228Z 
2025-05-07T20:25:32.8128560Z nsight-compute-2025. | 320.6 MB  | 1          |   2% [A
2025-05-07T20:25:32.8128957Z 
2025-05-07T20:25:32.8129232Z 
2025-05-07T20:25:32.8133371Z libcusparse-12.5.7.5 | 164.9 MB  | 6          |   7% [A[A
2025-05-07T20:25:32.8133779Z 
2025-05-07T20:25:32.8133785Z 
2025-05-07T20:25:32.8135206Z 
2025-05-07T20:25:32.8184047Z libcusolver-11.7.2.5 | 156.9 MB  | 5          |   5% [A[A[A
2025-05-07T20:25:32.8184496Z 
2025-05-07T20:25:32.8184502Z 
2025-05-07T20:25:32.8184508Z 
2025-05-07T20:25:32.8184514Z 
2025-05-07T20:25:32.8368918Z libcufft-11.3.3.41   | 147.4 MB  | 6          |   7% [A[A[A[A
2025-05-07T20:25:32.9128026Z libcublas-12.8.3.14  | 460.2 MB  |            |   1% 
2025-05-07T20:25:32.9128438Z 
2025-05-07T20:25:32.9257526Z nsight-compute-2025. | 320.6 MB  | 2          |   3% [A
2025-05-07T20:25:32.9257955Z 
2025-05-07T20:25:32.9257962Z 
2025-05-07T20:25:32.9257968Z 
2025-05-07T20:25:32.9257985Z 
2025-05-07T20:25:32.9275526Z libcufft-11.3.3.41   | 147.4 MB  | 9          |   9% [A[A[A[A
2025-05-07T20:25:32.9275944Z 
2025-05-07T20:25:32.9275950Z 
2025-05-07T20:25:32.9278503Z 
2025-05-07T20:25:32.9335981Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   7% [A[A[A
2025-05-07T20:25:32.9336404Z 
2025-05-07T20:25:32.9336410Z 
2025-05-07T20:25:32.9369018Z libcusparse-12.5.7.5 | 164.9 MB  | 9          |   9% [A[A
2025-05-07T20:25:33.0136333Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:25:33.0136782Z 
2025-05-07T20:25:33.0370809Z nsight-compute-2025. | 320.6 MB  | 3          |   4% [A
2025-05-07T20:25:33.0399207Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   2% 
2025-05-07T20:25:33.0399596Z 
2025-05-07T20:25:33.0399603Z 
2025-05-07T20:25:33.0399608Z 
2025-05-07T20:25:33.0401495Z 
2025-05-07T20:25:33.0419383Z libcufft-11.3.3.41   | 147.4 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:25:33.0419798Z 
2025-05-07T20:25:33.0419804Z 
2025-05-07T20:25:33.0420442Z 
2025-05-07T20:25:33.0423181Z libcusolver-11.7.2.5 | 156.9 MB  | 9          |   9% [A[A[A
2025-05-07T20:25:33.0423595Z 
2025-05-07T20:25:33.0423601Z 
2025-05-07T20:25:33.1137772Z libcusparse-12.5.7.5 | 164.9 MB  | #1         |  11% [A[A
2025-05-07T20:25:33.1138246Z 
2025-05-07T20:25:33.1373297Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:25:33.1420247Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:25:33.1420632Z 
2025-05-07T20:25:33.1420639Z 
2025-05-07T20:25:33.1424606Z 
2025-05-07T20:25:33.1493750Z libcusolver-11.7.2.5 | 156.9 MB  | #          |  11% [A[A[A
2025-05-07T20:25:33.1494203Z 
2025-05-07T20:25:33.1494319Z 
2025-05-07T20:25:33.1536388Z libcusparse-12.5.7.5 | 164.9 MB  | #3         |  13% [A[A
2025-05-07T20:25:33.1536804Z 
2025-05-07T20:25:33.1536810Z 
2025-05-07T20:25:33.1536816Z 
2025-05-07T20:25:33.1536822Z 
2025-05-07T20:25:33.2141848Z libcufft-11.3.3.41   | 147.4 MB  | #3         |  14% [A[A[A[A
2025-05-07T20:25:33.2142310Z 
2025-05-07T20:25:33.2374517Z nsight-compute-2025. | 320.6 MB  | 5          |   6% [A
2025-05-07T20:25:33.2427134Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   3% 
2025-05-07T20:25:33.2427940Z 
2025-05-07T20:25:33.2427945Z 
2025-05-07T20:25:33.2430899Z 
2025-05-07T20:25:33.2609310Z libcusolver-11.7.2.5 | 156.9 MB  | #2         |  13% [A[A[A
2025-05-07T20:25:33.2609727Z 
2025-05-07T20:25:33.2609734Z 
2025-05-07T20:25:33.2620006Z libcusparse-12.5.7.5 | 164.9 MB  | #5         |  16% [A[A
2025-05-07T20:25:33.2620424Z 
2025-05-07T20:25:33.2620431Z 
2025-05-07T20:25:33.2620437Z 
2025-05-07T20:25:33.2622194Z 
2025-05-07T20:25:33.3142090Z libcufft-11.3.3.41   | 147.4 MB  | #5         |  16% [A[A[A[A
2025-05-07T20:25:33.3143658Z 
2025-05-07T20:25:33.3428969Z nsight-compute-2025. | 320.6 MB  | 6          |   7% [A
2025-05-07T20:25:33.3429285Z 
2025-05-07T20:25:33.3429290Z 
2025-05-07T20:25:33.3429298Z 
2025-05-07T20:25:33.3452272Z libcusolver-11.7.2.5 | 156.9 MB  | #4         |  15% [A[A[A
2025-05-07T20:25:33.3613073Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   4% 
2025-05-07T20:25:33.3613366Z 
2025-05-07T20:25:33.3616526Z 
2025-05-07T20:25:33.3620468Z libcusparse-12.5.7.5 | 164.9 MB  | #7         |  18% [A[A
2025-05-07T20:25:33.3621016Z 
2025-05-07T20:25:33.3621022Z 
2025-05-07T20:25:33.3621026Z 
2025-05-07T20:25:33.3621690Z 
2025-05-07T20:25:33.4147120Z libcufft-11.3.3.41   | 147.4 MB  | #8         |  18% [A[A[A[A
2025-05-07T20:25:33.4149511Z 
2025-05-07T20:25:33.4454635Z nsight-compute-2025. | 320.6 MB  | 7          |   7% [A
2025-05-07T20:25:33.4474843Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   5% 
2025-05-07T20:25:33.4475233Z 
2025-05-07T20:25:33.4475238Z 
2025-05-07T20:25:33.4475242Z 
2025-05-07T20:25:33.4629208Z libcusolver-11.7.2.5 | 156.9 MB  | #6         |  16% [A[A[A
2025-05-07T20:25:33.4629697Z 
2025-05-07T20:25:33.4629703Z 
2025-05-07T20:25:33.4633724Z libcusparse-12.5.7.5 | 164.9 MB  | #9         |  20% [A[A
2025-05-07T20:25:33.4634219Z 
2025-05-07T20:25:33.4634225Z 
2025-05-07T20:25:33.4634231Z 
2025-05-07T20:25:33.4636936Z 
2025-05-07T20:25:33.5147188Z libcufft-11.3.3.41   | 147.4 MB  | ##         |  20% [A[A[A[A
2025-05-07T20:25:33.5149884Z 
2025-05-07T20:25:33.5457012Z nsight-compute-2025. | 320.6 MB  | 8          |   8% [A
2025-05-07T20:25:33.5528917Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:25:33.5529232Z 
2025-05-07T20:25:33.5529238Z 
2025-05-07T20:25:33.5529263Z 
2025-05-07T20:25:33.5629322Z libcusolver-11.7.2.5 | 156.9 MB  | #8         |  18% [A[A[A
2025-05-07T20:25:33.5629690Z 
2025-05-07T20:25:33.5629696Z 
2025-05-07T20:25:33.5637328Z libcusparse-12.5.7.5 | 164.9 MB  | ##2        |  22% [A[A
2025-05-07T20:25:33.5637733Z 
2025-05-07T20:25:33.5637738Z 
2025-05-07T20:25:33.5637742Z 
2025-05-07T20:25:33.5642125Z 
2025-05-07T20:25:33.6151907Z libcufft-11.3.3.41   | 147.4 MB  | ##2        |  23% [A[A[A[A
2025-05-07T20:25:33.6154592Z 
2025-05-07T20:25:33.6459776Z nsight-compute-2025. | 320.6 MB  | 9          |   9% [A
2025-05-07T20:25:33.6529170Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:25:33.6529622Z 
2025-05-07T20:25:33.6529628Z 
2025-05-07T20:25:33.6533257Z 
2025-05-07T20:25:33.6639475Z libcusolver-11.7.2.5 | 156.9 MB  | ##         |  20% [A[A[A
2025-05-07T20:25:33.6639920Z 
2025-05-07T20:25:33.6639925Z 
2025-05-07T20:25:33.6639929Z 
2025-05-07T20:25:33.6639933Z 
2025-05-07T20:25:33.6713695Z libcufft-11.3.3.41   | 147.4 MB  | ##4        |  25% [A[A[A[A
2025-05-07T20:25:33.6714128Z 
2025-05-07T20:25:33.6714135Z 
2025-05-07T20:25:33.7159461Z libcusparse-12.5.7.5 | 164.9 MB  | ##4        |  24% [A[A
2025-05-07T20:25:33.7159897Z 
2025-05-07T20:25:33.7532461Z nsight-compute-2025. | 320.6 MB  | #          |  10% [A
2025-05-07T20:25:33.7532875Z 
2025-05-07T20:25:33.7532881Z 
2025-05-07T20:25:33.7532886Z 
2025-05-07T20:25:33.7643484Z libcusolver-11.7.2.5 | 156.9 MB  | ##2        |  22% [A[A[A
2025-05-07T20:25:33.7643898Z 
2025-05-07T20:25:33.7643914Z 
2025-05-07T20:25:33.7643920Z 
2025-05-07T20:25:33.7643925Z 
2025-05-07T20:25:33.7714915Z libcufft-11.3.3.41   | 147.4 MB  | ##7        |  27% [A[A[A[A
2025-05-07T20:25:33.7715326Z 
2025-05-07T20:25:33.7715699Z 
2025-05-07T20:25:33.8164571Z libcusparse-12.5.7.5 | 164.9 MB  | ##6        |  26% [A[A
2025-05-07T20:25:33.8166659Z 
2025-05-07T20:25:33.8211470Z nsight-compute-2025. | 320.6 MB  | #1         |  12% [A
2025-05-07T20:25:33.8534916Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:25:33.8535306Z 
2025-05-07T20:25:33.8535312Z 
2025-05-07T20:25:33.8538238Z 
2025-05-07T20:25:33.8644301Z libcusolver-11.7.2.5 | 156.9 MB  | ##4        |  25% [A[A[A
2025-05-07T20:25:33.8644721Z 
2025-05-07T20:25:33.8644738Z 
2025-05-07T20:25:33.8644743Z 
2025-05-07T20:25:33.8644749Z 
2025-05-07T20:25:33.8722236Z libcufft-11.3.3.41   | 147.4 MB  | ##9        |  30% [A[A[A[A
2025-05-07T20:25:33.8722632Z 
2025-05-07T20:25:33.8726161Z 
2025-05-07T20:25:33.9165688Z libcusparse-12.5.7.5 | 164.9 MB  | ##8        |  29% [A[A
2025-05-07T20:25:33.9166158Z 
2025-05-07T20:25:33.9396006Z nsight-compute-2025. | 320.6 MB  | #2         |  13% [A
2025-05-07T20:25:33.9536005Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   7% 
2025-05-07T20:25:33.9536404Z 
2025-05-07T20:25:33.9536689Z 
2025-05-07T20:25:33.9540729Z 
2025-05-07T20:25:33.9655269Z libcusolver-11.7.2.5 | 156.9 MB  | ##6        |  27% [A[A[A
2025-05-07T20:25:33.9655689Z 
2025-05-07T20:25:33.9655978Z 
2025-05-07T20:25:33.9655987Z 
2025-05-07T20:25:33.9656057Z 
2025-05-07T20:25:33.9724380Z libcufft-11.3.3.41   | 147.4 MB  | ###2       |  32% [A[A[A[A
2025-05-07T20:25:33.9724790Z 
2025-05-07T20:25:33.9727211Z 
2025-05-07T20:25:34.0169514Z libcusparse-12.5.7.5 | 164.9 MB  | ###        |  31% [A[A
2025-05-07T20:25:34.0170201Z 
2025-05-07T20:25:34.0450389Z nsight-compute-2025. | 320.6 MB  | #3         |  14% [A
2025-05-07T20:25:34.0537607Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:25:34.0538216Z 
2025-05-07T20:25:34.0538404Z 
2025-05-07T20:25:34.0538449Z 
2025-05-07T20:25:34.0655479Z libcusolver-11.7.2.5 | 156.9 MB  | ##9        |  29% [A[A[A
2025-05-07T20:25:34.0655903Z 
2025-05-07T20:25:34.0655909Z 
2025-05-07T20:25:34.0655914Z 
2025-05-07T20:25:34.0655920Z 
2025-05-07T20:25:34.0725478Z libcufft-11.3.3.41   | 147.4 MB  | ###4       |  35% [A[A[A[A
2025-05-07T20:25:34.0725891Z 
2025-05-07T20:25:34.0727858Z 
2025-05-07T20:25:34.1170312Z libcusparse-12.5.7.5 | 164.9 MB  | ###3       |  33% [A[A
2025-05-07T20:25:34.1170752Z 
2025-05-07T20:25:34.1453258Z nsight-compute-2025. | 320.6 MB  | #4         |  15% [A
2025-05-07T20:25:34.1546036Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   8% 
2025-05-07T20:25:34.1546445Z 
2025-05-07T20:25:34.1546451Z 
2025-05-07T20:25:34.1548765Z 
2025-05-07T20:25:34.1658219Z libcusolver-11.7.2.5 | 156.9 MB  | ###1       |  31% [A[A[A
2025-05-07T20:25:34.1658644Z 
2025-05-07T20:25:34.1658650Z 
2025-05-07T20:25:34.1658656Z 
2025-05-07T20:25:34.1658661Z 
2025-05-07T20:25:34.1726554Z libcufft-11.3.3.41   | 147.4 MB  | ###7       |  37% [A[A[A[A
2025-05-07T20:25:34.1726970Z 
2025-05-07T20:25:34.1727140Z 
2025-05-07T20:25:34.2193124Z libcusparse-12.5.7.5 | 164.9 MB  | ###5       |  35% [A[A
2025-05-07T20:25:34.2199485Z 
2025-05-07T20:25:34.2456614Z nsight-compute-2025. | 320.6 MB  | #5         |  16% [A
2025-05-07T20:25:34.2726186Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   9% 
2025-05-07T20:25:34.2726602Z 
2025-05-07T20:25:34.2726627Z 
2025-05-07T20:25:34.2726633Z 
2025-05-07T20:25:34.2726639Z 
2025-05-07T20:25:34.2802997Z libcufft-11.3.3.41   | 147.4 MB  | ###9       |  40% [A[A[A[A
2025-05-07T20:25:34.2803424Z 
2025-05-07T20:25:34.2803429Z 
2025-05-07T20:25:34.2805754Z 
2025-05-07T20:25:34.2814720Z libcusolver-11.7.2.5 | 156.9 MB  | ###3       |  34% [A[A[A
2025-05-07T20:25:34.2815095Z 
2025-05-07T20:25:34.2815939Z 
2025-05-07T20:25:34.3193148Z libcusparse-12.5.7.5 | 164.9 MB  | ###7       |  38% [A[A
2025-05-07T20:25:34.3194117Z 
2025-05-07T20:25:34.3728230Z nsight-compute-2025. | 320.6 MB  | #7         |  17% [A
2025-05-07T20:25:34.3728534Z 
2025-05-07T20:25:34.3728538Z 
2025-05-07T20:25:34.3728542Z 
2025-05-07T20:25:34.3728546Z 
2025-05-07T20:25:34.3803511Z libcufft-11.3.3.41   | 147.4 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:25:34.3804035Z 
2025-05-07T20:25:34.3804040Z 
2025-05-07T20:25:34.3806687Z 
2025-05-07T20:25:34.3815424Z libcusolver-11.7.2.5 | 156.9 MB  | ###5       |  36% [A[A[A
2025-05-07T20:25:34.3815723Z 
2025-05-07T20:25:34.3815729Z 
2025-05-07T20:25:34.4277426Z libcusparse-12.5.7.5 | 164.9 MB  | ####       |  40% [A[A
2025-05-07T20:25:34.4278391Z 
2025-05-07T20:25:34.4728315Z nsight-compute-2025. | 320.6 MB  | #8         |  18% [A
2025-05-07T20:25:34.4728830Z 
2025-05-07T20:25:34.4728834Z 
2025-05-07T20:25:34.4728837Z 
2025-05-07T20:25:34.4728841Z 
2025-05-07T20:25:34.4818672Z libcufft-11.3.3.41   | 147.4 MB  | ####4      |  45% [A[A[A[A
2025-05-07T20:25:34.4818972Z 
2025-05-07T20:25:34.4818976Z 
2025-05-07T20:25:34.4834098Z libcusparse-12.5.7.5 | 164.9 MB  | ####2      |  42% [A[A
2025-05-07T20:25:34.4834391Z 
2025-05-07T20:25:34.4834395Z 
2025-05-07T20:25:34.4838083Z 
2025-05-07T20:25:34.4851808Z libcusolver-11.7.2.5 | 156.9 MB  | ###7       |  38% [A[A[A
2025-05-07T20:25:34.5366367Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:25:34.5366894Z 
2025-05-07T20:25:34.5732707Z nsight-compute-2025. | 320.6 MB  | #9         |  19% [A
2025-05-07T20:25:34.5733318Z 
2025-05-07T20:25:34.5733324Z 
2025-05-07T20:25:34.5733328Z 
2025-05-07T20:25:34.5733332Z 
2025-05-07T20:25:34.5884860Z libcufft-11.3.3.41   | 147.4 MB  | ####7      |  48% [A[A[A[A
2025-05-07T20:25:34.5885171Z 
2025-05-07T20:25:34.5885176Z 
2025-05-07T20:25:34.5885941Z libcusparse-12.5.7.5 | 164.9 MB  | ####4      |  45% [A[A
2025-05-07T20:25:34.5886252Z 
2025-05-07T20:25:34.5886259Z 
2025-05-07T20:25:34.5888907Z 
2025-05-07T20:25:34.6305005Z libcusolver-11.7.2.5 | 156.9 MB  | ###9       |  40% [A[A[A
2025-05-07T20:25:34.6367003Z libcublas-12.8.3.14  | 460.2 MB  | 9          |  10% 
2025-05-07T20:25:34.6367283Z 
2025-05-07T20:25:34.6734061Z nsight-compute-2025. | 320.6 MB  | ##         |  20% [A
2025-05-07T20:25:34.6734357Z 
2025-05-07T20:25:34.6734362Z 
2025-05-07T20:25:34.6734366Z 
2025-05-07T20:25:34.6734370Z 
2025-05-07T20:25:34.6888078Z libcufft-11.3.3.41   | 147.4 MB  | #####      |  51% [A[A[A[A
2025-05-07T20:25:34.6888383Z 
2025-05-07T20:25:34.6890742Z 
2025-05-07T20:25:34.6992808Z libcusparse-12.5.7.5 | 164.9 MB  | ####7      |  47% [A[A
2025-05-07T20:25:34.6993121Z 
2025-05-07T20:25:34.6993126Z 
2025-05-07T20:25:34.6993911Z 
2025-05-07T20:25:34.7312137Z libcusolver-11.7.2.5 | 156.9 MB  | ####1      |  42% [A[A[A
2025-05-07T20:25:34.7398282Z libcublas-12.8.3.14  | 460.2 MB  | #          |  10% 
2025-05-07T20:25:34.7398642Z 
2025-05-07T20:25:34.7856266Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:25:34.7856631Z 
2025-05-07T20:25:34.7856635Z 
2025-05-07T20:25:34.7856640Z 
2025-05-07T20:25:34.7856654Z 
2025-05-07T20:25:34.7952199Z libcufft-11.3.3.41   | 147.4 MB  | #####3     |  53% [A[A[A[A
2025-05-07T20:25:34.7952504Z 
2025-05-07T20:25:34.7952510Z 
2025-05-07T20:25:34.8046690Z libcusparse-12.5.7.5 | 164.9 MB  | ####9      |  49% [A[A
2025-05-07T20:25:34.8047013Z 
2025-05-07T20:25:34.8047035Z 
2025-05-07T20:25:34.8048841Z 
2025-05-07T20:25:34.8312487Z libcusolver-11.7.2.5 | 156.9 MB  | ####3      |  44% [A[A[A
2025-05-07T20:25:34.8441400Z libcublas-12.8.3.14  | 460.2 MB  | #          |  11% 
2025-05-07T20:25:34.8441956Z 
2025-05-07T20:25:34.9011034Z nsight-compute-2025. | 320.6 MB  | ##2        |  22% [A
2025-05-07T20:25:34.9011336Z 
2025-05-07T20:25:34.9011807Z 
2025-05-07T20:25:34.9024417Z libcusparse-12.5.7.5 | 164.9 MB  | #####1     |  52% [A[A
2025-05-07T20:25:34.9024726Z 
2025-05-07T20:25:34.9024730Z 
2025-05-07T20:25:34.9024738Z 
2025-05-07T20:25:34.9027406Z 
2025-05-07T20:25:34.9226010Z libcufft-11.3.3.41   | 147.4 MB  | #####5     |  56% [A[A[A[A
2025-05-07T20:25:34.9226370Z 
2025-05-07T20:25:34.9226375Z 
2025-05-07T20:25:34.9226379Z 
2025-05-07T20:25:34.9313453Z libcusolver-11.7.2.5 | 156.9 MB  | ####5      |  46% [A[A[A
2025-05-07T20:25:34.9444425Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:25:34.9444748Z 
2025-05-07T20:25:35.0012513Z nsight-compute-2025. | 320.6 MB  | ##3        |  23% [A
2025-05-07T20:25:35.0012881Z 
2025-05-07T20:25:35.0012887Z 
2025-05-07T20:25:35.0049078Z libcusparse-12.5.7.5 | 164.9 MB  | #####3     |  54% [A[A
2025-05-07T20:25:35.0049426Z 
2025-05-07T20:25:35.0049431Z 
2025-05-07T20:25:35.0049435Z 
2025-05-07T20:25:35.0049439Z 
2025-05-07T20:25:35.0315928Z libcufft-11.3.3.41   | 147.4 MB  | #####8     |  58% [A[A[A[A
2025-05-07T20:25:35.0521143Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  12% 
2025-05-07T20:25:35.0521416Z 
2025-05-07T20:25:35.0521420Z 
2025-05-07T20:25:35.0522780Z 
2025-05-07T20:25:35.0589078Z libcusolver-11.7.2.5 | 156.9 MB  | ####7      |  47% [A[A[A
2025-05-07T20:25:35.0589377Z 
2025-05-07T20:25:35.1035040Z nsight-compute-2025. | 320.6 MB  | ##4        |  24% [A
2025-05-07T20:25:35.1035372Z 
2025-05-07T20:25:35.1035376Z 
2025-05-07T20:25:35.1063884Z libcusparse-12.5.7.5 | 164.9 MB  | #####6     |  56% [A[A
2025-05-07T20:25:35.1064186Z 
2025-05-07T20:25:35.1064191Z 
2025-05-07T20:25:35.1064452Z 
2025-05-07T20:25:35.1064456Z 
2025-05-07T20:25:35.1316683Z libcufft-11.3.3.41   | 147.4 MB  | ######     |  61% [A[A[A[A
2025-05-07T20:25:35.1661515Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:25:35.1661806Z 
2025-05-07T20:25:35.1723699Z nsight-compute-2025. | 320.6 MB  | ##5        |  25% [A
2025-05-07T20:25:35.1724377Z 
2025-05-07T20:25:35.1724382Z 
2025-05-07T20:25:35.1726506Z 
2025-05-07T20:25:35.2036553Z libcusolver-11.7.2.5 | 156.9 MB  | ####9      |  49% [A[A[A
2025-05-07T20:25:35.2036844Z 
2025-05-07T20:25:35.2036848Z 
2025-05-07T20:25:35.2072703Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  58% [A[A
2025-05-07T20:25:35.2072988Z 
2025-05-07T20:25:35.2072991Z 
2025-05-07T20:25:35.2072995Z 
2025-05-07T20:25:35.2072999Z 
2025-05-07T20:25:35.2663253Z libcufft-11.3.3.41   | 147.4 MB  | ######3    |  63% [A[A[A[A
2025-05-07T20:25:35.2666942Z 
2025-05-07T20:25:35.2728018Z nsight-compute-2025. | 320.6 MB  | ##6        |  26% [A
2025-05-07T20:25:35.2728557Z 
2025-05-07T20:25:35.2728561Z 
2025-05-07T20:25:35.2728565Z 
2025-05-07T20:25:35.3036517Z libcusolver-11.7.2.5 | 156.9 MB  | #####1     |  51% [A[A[A
2025-05-07T20:25:35.3036826Z 
2025-05-07T20:25:35.3036831Z 
2025-05-07T20:25:35.3074350Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:25:35.3074779Z 
2025-05-07T20:25:35.3074783Z 
2025-05-07T20:25:35.3074787Z 
2025-05-07T20:25:35.3074791Z 
2025-05-07T20:25:35.3241298Z libcufft-11.3.3.41   | 147.4 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:25:35.3663559Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  14% 
2025-05-07T20:25:35.3664900Z 
2025-05-07T20:25:35.3768894Z nsight-compute-2025. | 320.6 MB  | ##7        |  28% [A
2025-05-07T20:25:35.3769179Z 
2025-05-07T20:25:35.3769191Z 
2025-05-07T20:25:35.3769195Z 
2025-05-07T20:25:35.4088259Z libcusolver-11.7.2.5 | 156.9 MB  | #####3     |  53% [A[A[A
2025-05-07T20:25:35.4088567Z 
2025-05-07T20:25:35.4088571Z 
2025-05-07T20:25:35.4110575Z libcusparse-12.5.7.5 | 164.9 MB  | ######3    |  63% [A[A
2025-05-07T20:25:35.4110984Z 
2025-05-07T20:25:35.4110990Z 
2025-05-07T20:25:35.4110995Z 
2025-05-07T20:25:35.4113322Z 
2025-05-07T20:25:35.4242992Z libcufft-11.3.3.41   | 147.4 MB  | ######8    |  68% [A[A[A[A
2025-05-07T20:25:35.4666301Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:25:35.4666703Z 
2025-05-07T20:25:35.4772172Z nsight-compute-2025. | 320.6 MB  | ##8        |  29% [A
2025-05-07T20:25:35.4772464Z 
2025-05-07T20:25:35.4772471Z 
2025-05-07T20:25:35.5089484Z 
2025-05-07T20:25:35.5090022Z libcusolver-11.7.2.5 | 156.9 MB  | #####5     |  56% [A[A[A
2025-05-07T20:25:35.5090423Z 
2025-05-07T20:25:35.5091808Z 
2025-05-07T20:25:35.5112703Z libcusparse-12.5.7.5 | 164.9 MB  | ######5    |  66% [A[A
2025-05-07T20:25:35.5113139Z 
2025-05-07T20:25:35.5113146Z 
2025-05-07T20:25:35.5113152Z 
2025-05-07T20:25:35.5113739Z 
2025-05-07T20:25:35.5705090Z libcufft-11.3.3.41   | 147.4 MB  | #######1   |  71% [A[A[A[A
2025-05-07T20:25:35.5706096Z 
2025-05-07T20:25:35.5772166Z nsight-compute-2025. | 320.6 MB  | ###        |  30% [A
2025-05-07T20:25:35.5772572Z 
2025-05-07T20:25:35.5772578Z 
2025-05-07T20:25:35.5772584Z 
2025-05-07T20:25:35.5821413Z libcusolver-11.7.2.5 | 156.9 MB  | #####7     |  58% [A[A[A
2025-05-07T20:25:35.6094471Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  15% 
2025-05-07T20:25:35.6094797Z 
2025-05-07T20:25:35.6097011Z 
2025-05-07T20:25:35.6114384Z libcusparse-12.5.7.5 | 164.9 MB  | ######8    |  68% [A[A
2025-05-07T20:25:35.6114761Z 
2025-05-07T20:25:35.6114766Z 
2025-05-07T20:25:35.6114770Z 
2025-05-07T20:25:35.6118237Z 
2025-05-07T20:25:35.6773903Z libcufft-11.3.3.41   | 147.4 MB  | #######3   |  74% [A[A[A[A
2025-05-07T20:25:35.6774203Z 
2025-05-07T20:25:35.6774207Z 
2025-05-07T20:25:35.6775198Z 
2025-05-07T20:25:35.6822282Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  60% [A[A[A
2025-05-07T20:25:35.6992027Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:25:35.6992774Z 
2025-05-07T20:25:35.7258724Z nsight-compute-2025. | 320.6 MB  | ###1       |  31% [A
2025-05-07T20:25:35.7259022Z 
2025-05-07T20:25:35.7259026Z 
2025-05-07T20:25:35.7290206Z libcusparse-12.5.7.5 | 164.9 MB  | #######    |  71% [A[A
2025-05-07T20:25:35.7290500Z 
2025-05-07T20:25:35.7290504Z 
2025-05-07T20:25:35.7290508Z 
2025-05-07T20:25:35.7291443Z 
2025-05-07T20:25:35.7774565Z libcufft-11.3.3.41   | 147.4 MB  | #######6   |  76% [A[A[A[A
2025-05-07T20:25:35.7774862Z 
2025-05-07T20:25:35.7774866Z 
2025-05-07T20:25:35.7775388Z 
2025-05-07T20:25:35.7824933Z libcusolver-11.7.2.5 | 156.9 MB  | ######2    |  62% [A[A[A
2025-05-07T20:25:35.7993080Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  17% 
2025-05-07T20:25:35.7995809Z 
2025-05-07T20:25:35.8323018Z nsight-compute-2025. | 320.6 MB  | ###2       |  32% [A
2025-05-07T20:25:35.8323313Z 
2025-05-07T20:25:35.8323317Z 
2025-05-07T20:25:35.8349972Z libcusparse-12.5.7.5 | 164.9 MB  | #######2   |  73% [A[A
2025-05-07T20:25:35.8350370Z 
2025-05-07T20:25:35.8350402Z 
2025-05-07T20:25:35.8350406Z 
2025-05-07T20:25:35.8352859Z 
2025-05-07T20:25:35.8774868Z libcufft-11.3.3.41   | 147.4 MB  | #######9   |  79% [A[A[A[A
2025-05-07T20:25:35.8775277Z 
2025-05-07T20:25:35.8775281Z 
2025-05-07T20:25:35.8776014Z 
2025-05-07T20:25:35.8994755Z libcusolver-11.7.2.5 | 156.9 MB  | ######5    |  65% [A[A[A
2025-05-07T20:25:35.8995463Z 
2025-05-07T20:25:35.9325366Z nsight-compute-2025. | 320.6 MB  | ###3       |  34% [A
2025-05-07T20:25:35.9325657Z 
2025-05-07T20:25:35.9325776Z 
2025-05-07T20:25:35.9352614Z libcusparse-12.5.7.5 | 164.9 MB  | #######5   |  75% [A[A
2025-05-07T20:25:35.9353041Z 
2025-05-07T20:25:35.9353047Z 
2025-05-07T20:25:35.9353052Z 
2025-05-07T20:25:35.9354624Z 
2025-05-07T20:25:35.9778565Z libcufft-11.3.3.41   | 147.4 MB  | ########1  |  82% [A[A[A[A
2025-05-07T20:25:35.9779424Z 
2025-05-07T20:25:35.9779431Z 
2025-05-07T20:25:35.9782043Z 
2025-05-07T20:25:35.9998573Z libcusolver-11.7.2.5 | 156.9 MB  | ######8    |  68% [A[A[A
2025-05-07T20:25:35.9999676Z 
2025-05-07T20:25:36.0223960Z nsight-compute-2025. | 320.6 MB  | ###5       |  35% [A
2025-05-07T20:25:36.0327952Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:25:36.0328242Z 
2025-05-07T20:25:36.0328246Z 
2025-05-07T20:25:36.0357375Z libcusparse-12.5.7.5 | 164.9 MB  | #######7   |  78% [A[A
2025-05-07T20:25:36.0357921Z 
2025-05-07T20:25:36.0358040Z 
2025-05-07T20:25:36.0358048Z 
2025-05-07T20:25:36.0358148Z 
2025-05-07T20:25:36.0896874Z libcufft-11.3.3.41   | 147.4 MB  | ########4  |  85% [A[A[A[A
2025-05-07T20:25:36.0897176Z 
2025-05-07T20:25:36.0897180Z 
2025-05-07T20:25:36.0897725Z 
2025-05-07T20:25:36.1071933Z libcusolver-11.7.2.5 | 156.9 MB  | #######    |  71% [A[A[A
2025-05-07T20:25:36.1072227Z 
2025-05-07T20:25:36.1417043Z nsight-compute-2025. | 320.6 MB  | ###6       |  36% [A
2025-05-07T20:25:36.1417332Z 
2025-05-07T20:25:36.1417337Z 
2025-05-07T20:25:36.1417341Z 
2025-05-07T20:25:36.1417809Z 
2025-05-07T20:25:36.1433621Z libcufft-11.3.3.41   | 147.4 MB  | ########7  |  87% [A[A[A[A
2025-05-07T20:25:36.1433954Z 
2025-05-07T20:25:36.1433958Z 
2025-05-07T20:25:36.1575144Z libcusparse-12.5.7.5 | 164.9 MB  | ########   |  80% [A[A
2025-05-07T20:25:36.2073751Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  18% 
2025-05-07T20:25:36.2075437Z 
2025-05-07T20:25:36.2244206Z nsight-compute-2025. | 320.6 MB  | ###7       |  38% [A
2025-05-07T20:25:36.2244522Z 
2025-05-07T20:25:36.2244528Z 
2025-05-07T20:25:36.2247304Z 
2025-05-07T20:25:36.2419133Z libcusolver-11.7.2.5 | 156.9 MB  | #######3   |  73% [A[A[A
2025-05-07T20:25:36.2419434Z 
2025-05-07T20:25:36.2419439Z 
2025-05-07T20:25:36.2419443Z 
2025-05-07T20:25:36.2419958Z 
2025-05-07T20:25:36.2434115Z libcufft-11.3.3.41   | 147.4 MB  | ########9  |  90% [A[A[A[A
2025-05-07T20:25:36.2434423Z 
2025-05-07T20:25:36.2434427Z 
2025-05-07T20:25:36.2577958Z libcusparse-12.5.7.5 | 164.9 MB  | ########2  |  83% [A[A
2025-05-07T20:25:36.3167358Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  19% 
2025-05-07T20:25:36.3167950Z 
2025-05-07T20:25:36.3244149Z nsight-compute-2025. | 320.6 MB  | ###8       |  39% [A
2025-05-07T20:25:36.3244445Z 
2025-05-07T20:25:36.3244450Z 
2025-05-07T20:25:36.3244747Z 
2025-05-07T20:25:36.3420394Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  76% [A[A[A
2025-05-07T20:25:36.3420693Z 
2025-05-07T20:25:36.3420697Z 
2025-05-07T20:25:36.3420701Z 
2025-05-07T20:25:36.3421232Z 
2025-05-07T20:25:36.3498135Z libcufft-11.3.3.41   | 147.4 MB  | #########2 |  92% [A[A[A[A
2025-05-07T20:25:36.3498440Z 
2025-05-07T20:25:36.3498903Z 
2025-05-07T20:25:36.4026960Z libcusparse-12.5.7.5 | 164.9 MB  | ########4  |  85% [A[A
2025-05-07T20:25:36.4249134Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  20% 
2025-05-07T20:25:36.4249420Z 
2025-05-07T20:25:36.4249424Z 
2025-05-07T20:25:36.4252820Z 
2025-05-07T20:25:36.4259004Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  78% [A[A[A
2025-05-07T20:25:36.4259613Z 
2025-05-07T20:25:36.4481404Z nsight-compute-2025. | 320.6 MB  | ###9       |  40% [A
2025-05-07T20:25:36.4481713Z 
2025-05-07T20:25:36.4481717Z 
2025-05-07T20:25:36.4481721Z 
2025-05-07T20:25:36.4483863Z 
2025-05-07T20:25:36.4528365Z libcufft-11.3.3.41   | 147.4 MB  | #########5 |  95% [A[A[A[A
2025-05-07T20:25:36.4528657Z 
2025-05-07T20:25:36.4528661Z 
2025-05-07T20:25:36.5028299Z libcusparse-12.5.7.5 | 164.9 MB  | ########7  |  87% [A[A
2025-05-07T20:25:36.5252650Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  20% 
2025-05-07T20:25:36.5252933Z 
2025-05-07T20:25:36.5252937Z 
2025-05-07T20:25:36.5256346Z 
2025-05-07T20:25:36.5365787Z libcusolver-11.7.2.5 | 156.9 MB  | ########   |  80% [A[A[A
2025-05-07T20:25:36.5367748Z 
2025-05-07T20:25:36.5483238Z nsight-compute-2025. | 320.6 MB  | ####1      |  41% [A
2025-05-07T20:25:36.5483545Z 
2025-05-07T20:25:36.5483549Z 
2025-05-07T20:25:36.5483553Z 
2025-05-07T20:25:36.5484090Z 
2025-05-07T20:25:36.5535627Z libcufft-11.3.3.41   | 147.4 MB  | #########7 |  98% [A[A[A[A
2025-05-07T20:25:36.5536032Z 
2025-05-07T20:25:36.5536072Z 
2025-05-07T20:25:36.6028688Z libcusparse-12.5.7.5 | 164.9 MB  | ########9  |  89% [A[A
2025-05-07T20:25:36.6257346Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  21% 
2025-05-07T20:25:36.6257875Z 
2025-05-07T20:25:36.6257883Z 
2025-05-07T20:25:36.6257888Z 
2025-05-07T20:25:36.6420761Z libcusolver-11.7.2.5 | 156.9 MB  | ########2  |  83% [A[A[A
2025-05-07T20:25:36.6423103Z 
2025-05-07T20:25:36.6641396Z nsight-compute-2025. | 320.6 MB  | ####2      |  42% [A
2025-05-07T20:25:36.6641732Z 
2025-05-07T20:25:36.6641736Z 
2025-05-07T20:25:36.7028665Z libcusparse-12.5.7.5 | 164.9 MB  | #########1 |  92% [A[A
2025-05-07T20:25:36.7270546Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:25:36.7270927Z 
2025-05-07T20:25:36.7270933Z 
2025-05-07T20:25:36.7270939Z 
2025-05-07T20:25:36.7641606Z libcusolver-11.7.2.5 | 156.9 MB  | ########5  |  85% [A[A[A
2025-05-07T20:25:36.7642010Z 
2025-05-07T20:25:36.7642016Z 
2025-05-07T20:25:36.7851432Z libcusparse-12.5.7.5 | 164.9 MB  | #########4 |  94% [A[A
2025-05-07T20:25:36.7852260Z 
2025-05-07T20:25:36.8271511Z nsight-compute-2025. | 320.6 MB  | ####3      |  43% [A
2025-05-07T20:25:36.8271914Z 
2025-05-07T20:25:36.8271944Z 
2025-05-07T20:25:36.8272640Z 
2025-05-07T20:25:36.8503745Z libcusolver-11.7.2.5 | 156.9 MB  | ########8  |  88% [A[A[A
2025-05-07T20:25:36.8649010Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  23% 
2025-05-07T20:25:36.8649398Z 
2025-05-07T20:25:36.8649404Z 
2025-05-07T20:25:36.9040440Z libcusparse-12.5.7.5 | 164.9 MB  | #########7 |  97% [A[A
2025-05-07T20:25:36.9040743Z 
2025-05-07T20:25:36.9278807Z nsight-compute-2025. | 320.6 MB  | ####4      |  44% [A
2025-05-07T20:25:36.9279108Z 
2025-05-07T20:25:36.9279112Z 
2025-05-07T20:25:36.9279537Z 
2025-05-07T20:25:36.9504732Z libcusolver-11.7.2.5 | 156.9 MB  | #########1 |  91% [A[A[A
2025-05-07T20:25:36.9784942Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:25:36.9785329Z 
2025-05-07T20:25:36.9785334Z 
2025-05-07T20:25:37.0110058Z libcusparse-12.5.7.5 | 164.9 MB  | #########9 | 100% [A[A
2025-05-07T20:25:37.0112371Z 
2025-05-07T20:25:37.0301714Z nsight-compute-2025. | 320.6 MB  | ####5      |  45% [A
2025-05-07T20:25:37.0302201Z 
2025-05-07T20:25:37.0302206Z 
2025-05-07T20:25:37.0302959Z 
2025-05-07T20:25:37.0506623Z libcusolver-11.7.2.5 | 156.9 MB  | #########4 |  94% [A[A[A
2025-05-07T20:25:37.1113351Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  25% 
2025-05-07T20:25:37.1114364Z 
2025-05-07T20:25:37.1365872Z nsight-compute-2025. | 320.6 MB  | ####6      |  46% [A
2025-05-07T20:25:37.1366222Z 
2025-05-07T20:25:37.1366227Z 
2025-05-07T20:25:37.1366231Z 
2025-05-07T20:25:37.1508736Z libcusolver-11.7.2.5 | 156.9 MB  | #########7 |  97% [A[A[A
2025-05-07T20:25:37.2113208Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:25:37.2115094Z 
2025-05-07T20:25:37.2484877Z nsight-compute-2025. | 320.6 MB  | ####7      |  48% [A
2025-05-07T20:25:37.2485174Z 
2025-05-07T20:25:37.2485179Z 
2025-05-07T20:25:37.2489894Z 
2025-05-07T20:25:37.2509211Z libcusolver-11.7.2.5 | 156.9 MB  | #########9 | 100% [A[A[A
2025-05-07T20:25:37.3117172Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:25:37.3117507Z 
2025-05-07T20:25:37.3512293Z nsight-compute-2025. | 320.6 MB  | ####8      |  49% [A
2025-05-07T20:25:37.4216502Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  27% 
2025-05-07T20:25:37.4218339Z 
2025-05-07T20:25:37.4514256Z nsight-compute-2025. | 320.6 MB  | #####      |  50% [A
2025-05-07T20:25:37.5216567Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  28% 
2025-05-07T20:25:37.5219264Z 
2025-05-07T20:25:37.5514843Z nsight-compute-2025. | 320.6 MB  | #####1     |  51% [A
2025-05-07T20:25:37.6217037Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  29% 
2025-05-07T20:25:37.6219435Z 
2025-05-07T20:25:37.6519606Z nsight-compute-2025. | 320.6 MB  | #####2     |  53% [A
2025-05-07T20:25:37.7219052Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  30% 
2025-05-07T20:25:37.7222181Z 
2025-05-07T20:25:37.7522319Z nsight-compute-2025. | 320.6 MB  | #####4     |  54% [A
2025-05-07T20:25:37.8222367Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  31% 
2025-05-07T20:25:37.8222637Z 
2025-05-07T20:25:37.8522832Z nsight-compute-2025. | 320.6 MB  | #####5     |  56% [A
2025-05-07T20:25:37.9225420Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  32% 
2025-05-07T20:25:37.9226351Z 
2025-05-07T20:25:37.9810677Z nsight-compute-2025. | 320.6 MB  | #####7     |  57% [A
2025-05-07T20:25:38.0226780Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:25:38.0228950Z 
2025-05-07T20:25:38.0814260Z nsight-compute-2025. | 320.6 MB  | #####9     |  59% [A
2025-05-07T20:25:38.1228304Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  34% 
2025-05-07T20:25:38.1233187Z 
2025-05-07T20:25:38.1983813Z nsight-compute-2025. | 320.6 MB  | ######     |  61% [A
2025-05-07T20:25:38.2266778Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  35% 
2025-05-07T20:25:38.2269400Z 
2025-05-07T20:25:38.3268428Z nsight-compute-2025. | 320.6 MB  | ######2    |  63% [A
2025-05-07T20:25:38.3268928Z 
2025-05-07T20:25:38.3611723Z nsight-compute-2025. | 320.6 MB  | ######4    |  64% [A
2025-05-07T20:25:38.4273357Z libcublas-12.8.3.14  | 460.2 MB  | ###6       |  36% 
2025-05-07T20:25:38.4273914Z 
2025-05-07T20:25:38.5275696Z nsight-compute-2025. | 320.6 MB  | ######6    |  66% [A
2025-05-07T20:25:38.5279599Z 
2025-05-07T20:25:38.5314948Z nsight-compute-2025. | 320.6 MB  | ######8    |  68% [A
2025-05-07T20:25:38.6339676Z libcublas-12.8.3.14  | 460.2 MB  | ###6       |  37% 
2025-05-07T20:25:38.6342351Z 
2025-05-07T20:25:38.6382670Z nsight-compute-2025. | 320.6 MB  | #######    |  70% [A
2025-05-07T20:25:38.7385253Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  38% 
2025-05-07T20:25:38.7722575Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  39% 
2025-05-07T20:25:38.7725620Z 
2025-05-07T20:25:38.8385875Z nsight-compute-2025. | 320.6 MB  | #######1   |  72% [A
2025-05-07T20:25:38.9028313Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  39% 
2025-05-07T20:25:38.9033966Z 
2025-05-07T20:25:38.9391942Z nsight-compute-2025. | 320.6 MB  | #######3   |  74% [A
2025-05-07T20:25:39.0194237Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  40% 
2025-05-07T20:25:39.0196719Z 
2025-05-07T20:25:39.0393280Z nsight-compute-2025. | 320.6 MB  | #######5   |  75% [A
2025-05-07T20:25:39.1273858Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  41% 
2025-05-07T20:25:39.1274259Z 
2025-05-07T20:25:39.1396315Z nsight-compute-2025. | 320.6 MB  | #######6   |  77% [A
2025-05-07T20:25:39.2276120Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  42% 
2025-05-07T20:25:39.2277246Z 
2025-05-07T20:25:39.2486352Z nsight-compute-2025. | 320.6 MB  | #######8   |  78% [A
2025-05-07T20:25:39.3462994Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  43% 
2025-05-07T20:25:39.3463679Z 
2025-05-07T20:25:39.3490732Z nsight-compute-2025. | 320.6 MB  | #######9   |  80% [A
2025-05-07T20:25:39.4493777Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  44% 
2025-05-07T20:25:39.4582700Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  45% 
2025-05-07T20:25:39.4584812Z 
2025-05-07T20:25:39.5516235Z nsight-compute-2025. | 320.6 MB  | ########   |  81% [A
2025-05-07T20:25:39.5664804Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:25:39.5665657Z 
2025-05-07T20:25:39.6525601Z nsight-compute-2025. | 320.6 MB  | ########2  |  82% [A
2025-05-07T20:25:39.6712413Z libcublas-12.8.3.14  | 460.2 MB  | ####6      |  46% 
2025-05-07T20:25:39.6713087Z 
2025-05-07T20:25:39.6713190Z 
2025-05-07T20:25:39.6713198Z 
2025-05-07T20:25:39.6718019Z 
2025-05-07T20:25:39.6785113Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:39.6785826Z 
2025-05-07T20:25:39.7328130Z nsight-compute-2025. | 320.6 MB  | ########3  |  84% [A
2025-05-07T20:25:39.7328567Z 
2025-05-07T20:25:39.7328574Z 
2025-05-07T20:25:39.7328580Z 
2025-05-07T20:25:39.7328587Z 
2025-05-07T20:25:39.7333267Z 
2025-05-07T20:25:39.7579167Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:25:39.8003899Z libcublas-12.8.3.14  | 460.2 MB  | ####7      |  47% 
2025-05-07T20:25:39.8008392Z 
2025-05-07T20:25:39.8331502Z nsight-compute-2025. | 320.6 MB  | ########4  |  85% [A
2025-05-07T20:25:39.8331812Z 
2025-05-07T20:25:39.8331816Z 
2025-05-07T20:25:39.8331820Z 
2025-05-07T20:25:39.8331824Z 
2025-05-07T20:25:39.8331828Z 
2025-05-07T20:25:39.8711545Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:25:39.9225928Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  48% 
2025-05-07T20:25:39.9227568Z 
2025-05-07T20:25:39.9359581Z nsight-compute-2025. | 320.6 MB  | ########6  |  86% [A
2025-05-07T20:25:39.9359958Z 
2025-05-07T20:25:39.9359965Z 
2025-05-07T20:25:39.9359971Z 
2025-05-07T20:25:39.9359977Z 
2025-05-07T20:25:39.9359983Z 
2025-05-07T20:25:39.9773523Z libnpp-12.3.3.65     | 130.6 MB  | 5          |   5% [A[A[A[A[A
2025-05-07T20:25:40.0340226Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  49% 
2025-05-07T20:25:40.0340627Z 
2025-05-07T20:25:40.0340659Z 
2025-05-07T20:25:40.0340663Z 
2025-05-07T20:25:40.0340667Z 
2025-05-07T20:25:40.0340671Z 
2025-05-07T20:25:40.0393942Z libnpp-12.3.3.65     | 130.6 MB  | 7          |   8% [A[A[A[A[A
2025-05-07T20:25:40.0398363Z 
2025-05-07T20:25:40.0890930Z nsight-compute-2025. | 320.6 MB  | ########7  |  87% [A
2025-05-07T20:25:40.1341055Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  50% 
2025-05-07T20:25:40.1341373Z 
2025-05-07T20:25:40.1341381Z 
2025-05-07T20:25:40.1341620Z 
2025-05-07T20:25:40.1341628Z 
2025-05-07T20:25:40.1343289Z 
2025-05-07T20:25:40.1394053Z libnpp-12.3.3.65     | 130.6 MB  | #          |  10% [A[A[A[A[A
2025-05-07T20:25:40.1395933Z 
2025-05-07T20:25:40.1513243Z nsight-compute-2025. | 320.6 MB  | ########8  |  88% [A
2025-05-07T20:25:40.1513771Z 
2025-05-07T20:25:40.1515503Z 
2025-05-07T20:25:40.1853374Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:40.1853790Z 
2025-05-07T20:25:40.1853796Z 
2025-05-07T20:25:40.1853802Z 
2025-05-07T20:25:40.1854161Z 
2025-05-07T20:25:40.1854167Z 
2025-05-07T20:25:40.1857120Z 
2025-05-07T20:25:40.2209537Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:40.2447216Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  50% 
2025-05-07T20:25:40.2447617Z 
2025-05-07T20:25:40.2447623Z 
2025-05-07T20:25:40.2447634Z 
2025-05-07T20:25:40.2447640Z 
2025-05-07T20:25:40.2447645Z 
2025-05-07T20:25:40.2688920Z libnpp-12.3.3.65     | 130.6 MB  | #2         |  12% [A[A[A[A[A
2025-05-07T20:25:40.2692021Z 
2025-05-07T20:25:40.2859292Z nsight-compute-2025. | 320.6 MB  | ########9  |  90% [A
2025-05-07T20:25:40.2859708Z 
2025-05-07T20:25:40.2859714Z 
2025-05-07T20:25:40.2859719Z 
2025-05-07T20:25:40.2859725Z 
2025-05-07T20:25:40.2859730Z 
2025-05-07T20:25:40.2862831Z 
2025-05-07T20:25:40.3342629Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   2% [A[A[A[A[A[A
2025-05-07T20:25:40.3559678Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  51% 
2025-05-07T20:25:40.3560090Z 
2025-05-07T20:25:40.3560118Z 
2025-05-07T20:25:40.3560124Z 
2025-05-07T20:25:40.3560130Z 
2025-05-07T20:25:40.3560135Z 
2025-05-07T20:25:40.3869818Z libnpp-12.3.3.65     | 130.6 MB  | #4         |  15% [A[A[A[A[A
2025-05-07T20:25:40.3870249Z 
2025-05-07T20:25:40.3870255Z 
2025-05-07T20:25:40.3870260Z 
2025-05-07T20:25:40.3870266Z 
2025-05-07T20:25:40.3870271Z 
2025-05-07T20:25:40.3870281Z 
2025-05-07T20:25:40.3953846Z cuda-nsight-12.8.55  | 113.2 MB  | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:25:40.3954298Z 
2025-05-07T20:25:40.4466721Z nsight-compute-2025. | 320.6 MB  | #########  |  91% [A
2025-05-07T20:25:40.4682335Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  52% 
2025-05-07T20:25:40.4682731Z 
2025-05-07T20:25:40.4682737Z 
2025-05-07T20:25:40.4682826Z 
2025-05-07T20:25:40.4682834Z 
2025-05-07T20:25:40.4687112Z 
2025-05-07T20:25:40.4871492Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:25:40.4871913Z 
2025-05-07T20:25:40.4871919Z 
2025-05-07T20:25:40.4871946Z 
2025-05-07T20:25:40.4871951Z 
2025-05-07T20:25:40.4871957Z 
2025-05-07T20:25:40.4877289Z 
2025-05-07T20:25:40.5234260Z cuda-nsight-12.8.55  | 113.2 MB  | 7          |   7% [A[A[A[A[A[A
2025-05-07T20:25:40.5234714Z 
2025-05-07T20:25:40.5615956Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:25:40.5776477Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  52% 
2025-05-07T20:25:40.5776895Z 
2025-05-07T20:25:40.5776901Z 
2025-05-07T20:25:40.5776907Z 
2025-05-07T20:25:40.5776913Z 
2025-05-07T20:25:40.5776919Z 
2025-05-07T20:25:40.5879711Z libnpp-12.3.3.65     | 130.6 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:25:40.5880131Z 
2025-05-07T20:25:40.5880137Z 
2025-05-07T20:25:40.5880143Z 
2025-05-07T20:25:40.5880148Z 
2025-05-07T20:25:40.5880154Z 
2025-05-07T20:25:40.5882784Z 
2025-05-07T20:25:40.6355921Z cuda-nsight-12.8.55  | 113.2 MB  | #          |  10% [A[A[A[A[A[A
2025-05-07T20:25:40.6356375Z 
2025-05-07T20:25:40.6757724Z nsight-compute-2025. | 320.6 MB  | #########2 |  92% [A
2025-05-07T20:25:40.6780402Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  53% 
2025-05-07T20:25:40.6781188Z 
2025-05-07T20:25:40.6781251Z 
2025-05-07T20:25:40.6781273Z 
2025-05-07T20:25:40.6781279Z 
2025-05-07T20:25:40.6781285Z 
2025-05-07T20:25:40.6882342Z libnpp-12.3.3.65     | 130.6 MB  | ##1        |  21% [A[A[A[A[A
2025-05-07T20:25:40.6882781Z 
2025-05-07T20:25:40.6882787Z 
2025-05-07T20:25:40.6882792Z 
2025-05-07T20:25:40.6882798Z 
2025-05-07T20:25:40.6882804Z 
2025-05-07T20:25:40.6883901Z 
2025-05-07T20:25:40.7489228Z cuda-nsight-12.8.55  | 113.2 MB  | #2         |  13% [A[A[A[A[A[A
2025-05-07T20:25:40.7490822Z 
2025-05-07T20:25:40.7761519Z nsight-compute-2025. | 320.6 MB  | #########3 |  93% [A
2025-05-07T20:25:40.7781858Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  54% 
2025-05-07T20:25:40.7782251Z 
2025-05-07T20:25:40.7782350Z 
2025-05-07T20:25:40.7782382Z 
2025-05-07T20:25:40.7782388Z 
2025-05-07T20:25:40.7782424Z 
2025-05-07T20:25:40.7882589Z libnpp-12.3.3.65     | 130.6 MB  | ##3        |  23% [A[A[A[A[A
2025-05-07T20:25:40.7883278Z 
2025-05-07T20:25:40.7883283Z 
2025-05-07T20:25:40.7883289Z 
2025-05-07T20:25:40.7883294Z 
2025-05-07T20:25:40.7883478Z 
2025-05-07T20:25:40.7884285Z 
2025-05-07T20:25:40.8609030Z cuda-nsight-12.8.55  | 113.2 MB  | #5         |  15% [A[A[A[A[A[A
2025-05-07T20:25:40.8610253Z 
2025-05-07T20:25:40.8818064Z nsight-compute-2025. | 320.6 MB  | #########4 |  94% [A
2025-05-07T20:25:40.8818496Z 
2025-05-07T20:25:40.8818502Z 
2025-05-07T20:25:40.8818508Z 
2025-05-07T20:25:40.8818513Z 
2025-05-07T20:25:40.8818523Z 
2025-05-07T20:25:40.8887370Z libnpp-12.3.3.65     | 130.6 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:25:40.8887809Z 
2025-05-07T20:25:40.8887815Z 
2025-05-07T20:25:40.8887820Z 
2025-05-07T20:25:40.8887826Z 
2025-05-07T20:25:40.8887832Z 
2025-05-07T20:25:40.8891043Z 
2025-05-07T20:25:40.8911344Z cuda-nsight-12.8.55  | 113.2 MB  | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:25:40.9166563Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  54% 
2025-05-07T20:25:40.9166987Z 
2025-05-07T20:25:40.9166994Z 
2025-05-07T20:25:40.9175359Z 
2025-05-07T20:25:40.9658487Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:40.9658924Z 
2025-05-07T20:25:40.9658930Z 
2025-05-07T20:25:40.9658936Z 
2025-05-07T20:25:40.9658947Z 
2025-05-07T20:25:40.9658952Z 
2025-05-07T20:25:40.9658958Z 
2025-05-07T20:25:40.9660360Z 
2025-05-07T20:25:40.9951576Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:40.9952543Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  55% 
2025-05-07T20:25:40.9952905Z 
2025-05-07T20:25:40.9952912Z 
2025-05-07T20:25:40.9952917Z 
2025-05-07T20:25:40.9952934Z 
2025-05-07T20:25:40.9953062Z 
2025-05-07T20:25:41.0042523Z libnpp-12.3.3.65     | 130.6 MB  | ##7        |  28% [A[A[A[A[A
2025-05-07T20:25:41.0042930Z 
2025-05-07T20:25:41.0042944Z 
2025-05-07T20:25:41.0042958Z 
2025-05-07T20:25:41.0042964Z 
2025-05-07T20:25:41.0042969Z 
2025-05-07T20:25:41.0042995Z 
2025-05-07T20:25:41.0664907Z cuda-nsight-12.8.55  | 113.2 MB  | ##         |  21% [A[A[A[A[A[A
2025-05-07T20:25:41.0665365Z 
2025-05-07T20:25:41.0665371Z 
2025-05-07T20:25:41.0665394Z 
2025-05-07T20:25:41.0665401Z 
2025-05-07T20:25:41.0665406Z 
2025-05-07T20:25:41.0665412Z 
2025-05-07T20:25:41.0668977Z 
2025-05-07T20:25:41.0683634Z cuda-nvvp-12.8.57    | 112.4 MB  | 1          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:41.0684073Z 
2025-05-07T20:25:41.1149498Z nsight-compute-2025. | 320.6 MB  | #########5 |  95% [A
2025-05-07T20:25:41.1249826Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  55% 
2025-05-07T20:25:41.1250224Z 
2025-05-07T20:25:41.1250230Z 
2025-05-07T20:25:41.1250236Z 
2025-05-07T20:25:41.1250251Z 
2025-05-07T20:25:41.1251887Z 
2025-05-07T20:25:41.1354384Z libnpp-12.3.3.65     | 130.6 MB  | ##9        |  30% [A[A[A[A[A
2025-05-07T20:25:41.1354815Z 
2025-05-07T20:25:41.1354821Z 
2025-05-07T20:25:41.1354835Z 
2025-05-07T20:25:41.1354842Z 
2025-05-07T20:25:41.1354874Z 
2025-05-07T20:25:41.1354879Z 
2025-05-07T20:25:41.1670153Z cuda-nsight-12.8.55  | 113.2 MB  | ##3        |  23% [A[A[A[A[A[A
2025-05-07T20:25:41.1670595Z 
2025-05-07T20:25:41.1670627Z 
2025-05-07T20:25:41.1670633Z 
2025-05-07T20:25:41.1670639Z 
2025-05-07T20:25:41.1670644Z 
2025-05-07T20:25:41.1670649Z 
2025-05-07T20:25:41.1672640Z 
2025-05-07T20:25:41.1696910Z cuda-nvvp-12.8.57    | 112.4 MB  | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:25:41.1697348Z 
2025-05-07T20:25:41.2466817Z nsight-compute-2025. | 320.6 MB  | #########5 |  96% [A
2025-05-07T20:25:41.2467222Z 
2025-05-07T20:25:41.2467228Z 
2025-05-07T20:25:41.2467233Z 
2025-05-07T20:25:41.2467248Z 
2025-05-07T20:25:41.2467254Z 
2025-05-07T20:25:41.2467264Z 
2025-05-07T20:25:41.2504161Z cuda-nsight-12.8.55  | 113.2 MB  | ##5        |  25% [A[A[A[A[A[A
2025-05-07T20:25:41.2526713Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  56% 
2025-05-07T20:25:41.2527122Z 
2025-05-07T20:25:41.2527129Z 
2025-05-07T20:25:41.2527135Z 
2025-05-07T20:25:41.2527721Z 
2025-05-07T20:25:41.2527727Z 
2025-05-07T20:25:41.2682589Z libnpp-12.3.3.65     | 130.6 MB  | ###1       |  32% [A[A[A[A[A
2025-05-07T20:25:41.2683013Z 
2025-05-07T20:25:41.2683261Z 
2025-05-07T20:25:41.2683268Z 
2025-05-07T20:25:41.2683274Z 
2025-05-07T20:25:41.2683279Z 
2025-05-07T20:25:41.2683285Z 
2025-05-07T20:25:41.2684340Z 
2025-05-07T20:25:41.2698383Z cuda-nvvp-12.8.57    | 112.4 MB  | 5          |   6% [A[A[A[A[A[A[A
2025-05-07T20:25:41.2700598Z 
2025-05-07T20:25:41.3524812Z nsight-compute-2025. | 320.6 MB  | #########6 |  96% [A
2025-05-07T20:25:41.3525466Z 
2025-05-07T20:25:41.3525472Z 
2025-05-07T20:25:41.3525478Z 
2025-05-07T20:25:41.3525483Z 
2025-05-07T20:25:41.3525489Z 
2025-05-07T20:25:41.3528577Z 
2025-05-07T20:25:41.3547063Z cuda-nsight-12.8.55  | 113.2 MB  | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:25:41.3682747Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  57% 
2025-05-07T20:25:41.3683156Z 
2025-05-07T20:25:41.3683163Z 
2025-05-07T20:25:41.3683191Z 
2025-05-07T20:25:41.3683197Z 
2025-05-07T20:25:41.3683202Z 
2025-05-07T20:25:41.3683208Z 
2025-05-07T20:25:41.3685311Z 
2025-05-07T20:25:41.3745139Z cuda-nvvp-12.8.57    | 112.4 MB  | 7          |   7% [A[A[A[A[A[A[A
2025-05-07T20:25:41.3745586Z 
2025-05-07T20:25:41.3745593Z 
2025-05-07T20:25:41.3745598Z 
2025-05-07T20:25:41.3745604Z 
2025-05-07T20:25:41.3745609Z 
2025-05-07T20:25:41.3852979Z libnpp-12.3.3.65     | 130.6 MB  | ###3       |  33% [A[A[A[A[A
2025-05-07T20:25:41.3853396Z 
2025-05-07T20:25:41.4580721Z nsight-compute-2025. | 320.6 MB  | #########7 |  97% [A
2025-05-07T20:25:41.4581368Z 
2025-05-07T20:25:41.4581374Z 
2025-05-07T20:25:41.4581380Z 
2025-05-07T20:25:41.4581386Z 
2025-05-07T20:25:41.4581391Z 
2025-05-07T20:25:41.4585405Z 
2025-05-07T20:25:41.4719125Z cuda-nsight-12.8.55  | 113.2 MB  | ##9        |  30% [A[A[A[A[A[A
2025-05-07T20:25:41.4719578Z 
2025-05-07T20:25:41.4719584Z 
2025-05-07T20:25:41.4719590Z 
2025-05-07T20:25:41.4719595Z 
2025-05-07T20:25:41.4719624Z 
2025-05-07T20:25:41.4719630Z 
2025-05-07T20:25:41.4719636Z 
2025-05-07T20:25:41.4782185Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:25:41.4848192Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  57% 
2025-05-07T20:25:41.4848598Z 
2025-05-07T20:25:41.4848715Z 
2025-05-07T20:25:41.4848723Z 
2025-05-07T20:25:41.4848730Z 
2025-05-07T20:25:41.4848758Z 
2025-05-07T20:25:41.4991429Z libnpp-12.3.3.65     | 130.6 MB  | ###5       |  35% [A[A[A[A[A
2025-05-07T20:25:41.4993865Z 
2025-05-07T20:25:41.5663569Z nsight-compute-2025. | 320.6 MB  | #########7 |  98% [A
2025-05-07T20:25:41.5663975Z 
2025-05-07T20:25:41.5663982Z 
2025-05-07T20:25:41.5663999Z 
2025-05-07T20:25:41.5664005Z 
2025-05-07T20:25:41.5664011Z 
2025-05-07T20:25:41.5665406Z 
2025-05-07T20:25:41.5727224Z cuda-nsight-12.8.55  | 113.2 MB  | ###1       |  32% [A[A[A[A[A[A
2025-05-07T20:25:41.5728284Z 
2025-05-07T20:25:41.5728290Z 
2025-05-07T20:25:41.5728296Z 
2025-05-07T20:25:41.5728301Z 
2025-05-07T20:25:41.5728327Z 
2025-05-07T20:25:41.5728333Z 
2025-05-07T20:25:41.5730639Z 
2025-05-07T20:25:41.5807486Z cuda-nvvp-12.8.57    | 112.4 MB  | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:25:41.5870419Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:25:41.5870809Z 
2025-05-07T20:25:41.5870815Z 
2025-05-07T20:25:41.5870825Z 
2025-05-07T20:25:41.5870841Z 
2025-05-07T20:25:41.5870847Z 
2025-05-07T20:25:41.6052626Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  37% [A[A[A[A[A
2025-05-07T20:25:41.6053040Z 
2025-05-07T20:25:41.6701988Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:25:41.6702413Z 
2025-05-07T20:25:41.6702419Z 
2025-05-07T20:25:41.6702425Z 
2025-05-07T20:25:41.6702430Z 
2025-05-07T20:25:41.6702436Z 
2025-05-07T20:25:41.6707776Z 
2025-05-07T20:25:41.6736198Z cuda-nsight-12.8.55  | 113.2 MB  | ###3       |  34% [A[A[A[A[A[A
2025-05-07T20:25:41.6737117Z 
2025-05-07T20:25:41.6737123Z 
2025-05-07T20:25:41.6737129Z 
2025-05-07T20:25:41.6737863Z 
2025-05-07T20:25:41.6737867Z 
2025-05-07T20:25:41.6737882Z 
2025-05-07T20:25:41.6737886Z 
2025-05-07T20:25:41.6808362Z cuda-nvvp-12.8.57    | 112.4 MB  | #2         |  12% [A[A[A[A[A[A[A
2025-05-07T20:25:41.6965192Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  58% 
2025-05-07T20:25:41.6965851Z 
2025-05-07T20:25:41.6965857Z 
2025-05-07T20:25:41.6965863Z 
2025-05-07T20:25:41.6965869Z 
2025-05-07T20:25:41.6976609Z 
2025-05-07T20:25:41.7059098Z libnpp-12.3.3.65     | 130.6 MB  | ###8       |  38% [A[A[A[A[A
2025-05-07T20:25:41.7059514Z 
2025-05-07T20:25:41.7702572Z nsight-compute-2025. | 320.6 MB  | #########9 |  99% [A
2025-05-07T20:25:41.7703004Z 
2025-05-07T20:25:41.7703011Z 
2025-05-07T20:25:41.7703017Z 
2025-05-07T20:25:41.7703024Z 
2025-05-07T20:25:41.7703030Z 
2025-05-07T20:25:41.7706972Z 
2025-05-07T20:25:41.7745255Z cuda-nsight-12.8.55  | 113.2 MB  | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:25:41.7745708Z 
2025-05-07T20:25:41.7745715Z 
2025-05-07T20:25:41.7745740Z 
2025-05-07T20:25:41.7745745Z 
2025-05-07T20:25:41.7745751Z 
2025-05-07T20:25:41.7745756Z 
2025-05-07T20:25:41.7752213Z 
2025-05-07T20:25:41.7999834Z cuda-nvvp-12.8.57    | 112.4 MB  | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:25:41.8061164Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  58% 
2025-05-07T20:25:41.8065148Z 
2025-05-07T20:25:41.8073851Z nsight-compute-2025. | 320.6 MB  | #########9 | 100% [A
2025-05-07T20:25:41.8074253Z 
2025-05-07T20:25:41.8074260Z 
2025-05-07T20:25:41.8074265Z 
2025-05-07T20:25:41.8074271Z 
2025-05-07T20:25:41.8074276Z 
2025-05-07T20:25:41.8704524Z libnpp-12.3.3.65     | 130.6 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:25:41.8704949Z 
2025-05-07T20:25:41.8704954Z 
2025-05-07T20:25:41.8704960Z 
2025-05-07T20:25:41.8704966Z 
2025-05-07T20:25:41.8704972Z 
2025-05-07T20:25:41.8707381Z 
2025-05-07T20:25:41.8749528Z cuda-nsight-12.8.55  | 113.2 MB  | ###8       |  38% [A[A[A[A[A[A
2025-05-07T20:25:41.8749967Z 
2025-05-07T20:25:41.8749973Z 
2025-05-07T20:25:41.8750000Z 
2025-05-07T20:25:41.8750004Z 
2025-05-07T20:25:41.8750008Z 
2025-05-07T20:25:41.8750012Z 
2025-05-07T20:25:41.8755303Z 
2025-05-07T20:25:41.9045238Z cuda-nvvp-12.8.57    | 112.4 MB  | #6         |  16% [A[A[A[A[A[A[A
2025-05-07T20:25:41.9104481Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:25:41.9104867Z 
2025-05-07T20:25:41.9104873Z 
2025-05-07T20:25:41.9104878Z 
2025-05-07T20:25:41.9104884Z 
2025-05-07T20:25:41.9107026Z 
2025-05-07T20:25:41.9711031Z libnpp-12.3.3.65     | 130.6 MB  | ####1      |  42% [A[A[A[A[A
2025-05-07T20:25:41.9711467Z 
2025-05-07T20:25:41.9711473Z 
2025-05-07T20:25:41.9711478Z 
2025-05-07T20:25:41.9711483Z 
2025-05-07T20:25:41.9711490Z 
2025-05-07T20:25:41.9711496Z 
2025-05-07T20:25:41.9751963Z cuda-nsight-12.8.55  | 113.2 MB  | ####       |  40% [A[A[A[A[A[A
2025-05-07T20:25:41.9752409Z 
2025-05-07T20:25:41.9752415Z 
2025-05-07T20:25:41.9752420Z 
2025-05-07T20:25:41.9752425Z 
2025-05-07T20:25:41.9752431Z 
2025-05-07T20:25:41.9752456Z 
2025-05-07T20:25:41.9752461Z 
2025-05-07T20:25:42.0055044Z cuda-nvvp-12.8.57    | 112.4 MB  | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:25:42.0104871Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  59% 
2025-05-07T20:25:42.0105246Z 
2025-05-07T20:25:42.0105252Z 
2025-05-07T20:25:42.0105257Z 
2025-05-07T20:25:42.0105262Z 
2025-05-07T20:25:42.0110964Z 
2025-05-07T20:25:42.0713133Z libnpp-12.3.3.65     | 130.6 MB  | ####3      |  43% [A[A[A[A[A
2025-05-07T20:25:42.0713610Z 
2025-05-07T20:25:42.0713621Z 
2025-05-07T20:25:42.0713625Z 
2025-05-07T20:25:42.0713629Z 
2025-05-07T20:25:42.0713633Z 
2025-05-07T20:25:42.0713637Z 
2025-05-07T20:25:42.0752402Z cuda-nsight-12.8.55  | 113.2 MB  | ####2      |  43% [A[A[A[A[A[A
2025-05-07T20:25:42.0752870Z 
2025-05-07T20:25:42.0753015Z 
2025-05-07T20:25:42.0753020Z 
2025-05-07T20:25:42.0753024Z 
2025-05-07T20:25:42.0753027Z 
2025-05-07T20:25:42.0753031Z 
2025-05-07T20:25:42.0753035Z 
2025-05-07T20:25:42.1056255Z cuda-nvvp-12.8.57    | 112.4 MB  | ##         |  20% [A[A[A[A[A[A[A
2025-05-07T20:25:42.1107334Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  60% 
2025-05-07T20:25:42.1107601Z 
2025-05-07T20:25:42.1107800Z 
2025-05-07T20:25:42.1107806Z 
2025-05-07T20:25:42.1107810Z 
2025-05-07T20:25:42.1115311Z 
2025-05-07T20:25:42.1805249Z libnpp-12.3.3.65     | 130.6 MB  | ####5      |  45% [A[A[A[A[A
2025-05-07T20:25:42.1805564Z 
2025-05-07T20:25:42.1805568Z 
2025-05-07T20:25:42.1805572Z 
2025-05-07T20:25:42.1805576Z 
2025-05-07T20:25:42.1805580Z 
2025-05-07T20:25:42.1807332Z 
2025-05-07T20:25:42.1816141Z cuda-nsight-12.8.55  | 113.2 MB  | ####5      |  45% [A[A[A[A[A[A
2025-05-07T20:25:42.1816574Z 
2025-05-07T20:25:42.1816578Z 
2025-05-07T20:25:42.1816582Z 
2025-05-07T20:25:42.1816586Z 
2025-05-07T20:25:42.1816590Z 
2025-05-07T20:25:42.1816594Z 
2025-05-07T20:25:42.1821117Z 
2025-05-07T20:25:42.2070320Z cuda-nvvp-12.8.57    | 112.4 MB  | ##2        |  22% [A[A[A[A[A[A[A
2025-05-07T20:25:42.2110105Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  60% 
2025-05-07T20:25:42.2110503Z 
2025-05-07T20:25:42.2110509Z 
2025-05-07T20:25:42.2110516Z 
2025-05-07T20:25:42.2110521Z 
2025-05-07T20:25:42.2113090Z 
2025-05-07T20:25:42.2819498Z libnpp-12.3.3.65     | 130.6 MB  | ####7      |  47% [A[A[A[A[A
2025-05-07T20:25:42.2819923Z 
2025-05-07T20:25:42.2819929Z 
2025-05-07T20:25:42.2819934Z 
2025-05-07T20:25:42.2819940Z 
2025-05-07T20:25:42.2819946Z 
2025-05-07T20:25:42.2819951Z 
2025-05-07T20:25:42.2822245Z 
2025-05-07T20:25:42.2889599Z cuda-nvvp-12.8.57    | 112.4 MB  | ##3        |  24% [A[A[A[A[A[A[A
2025-05-07T20:25:42.2890255Z 
2025-05-07T20:25:42.2890259Z 
2025-05-07T20:25:42.2890263Z 
2025-05-07T20:25:42.2890276Z 
2025-05-07T20:25:42.2890280Z 
2025-05-07T20:25:42.2898360Z 
2025-05-07T20:25:42.3070999Z cuda-nsight-12.8.55  | 113.2 MB  | ####7      |  47% [A[A[A[A[A[A
2025-05-07T20:25:42.3114350Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  61% 
2025-05-07T20:25:42.3114742Z 
2025-05-07T20:25:42.3114749Z 
2025-05-07T20:25:42.3114777Z 
2025-05-07T20:25:42.3114783Z 
2025-05-07T20:25:42.3117842Z 
2025-05-07T20:25:42.3822959Z libnpp-12.3.3.65     | 130.6 MB  | ####9      |  49% [A[A[A[A[A
2025-05-07T20:25:42.3823300Z 
2025-05-07T20:25:42.3823305Z 
2025-05-07T20:25:42.3823309Z 
2025-05-07T20:25:42.3823313Z 
2025-05-07T20:25:42.3823317Z 
2025-05-07T20:25:42.3823320Z 
2025-05-07T20:25:42.3823639Z 
2025-05-07T20:25:42.3952885Z cuda-nvvp-12.8.57    | 112.4 MB  | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:25:42.3953203Z 
2025-05-07T20:25:42.3953208Z 
2025-05-07T20:25:42.3953212Z 
2025-05-07T20:25:42.3953216Z 
2025-05-07T20:25:42.3953231Z 
2025-05-07T20:25:42.3953235Z 
2025-05-07T20:25:42.4076577Z cuda-nsight-12.8.55  | 113.2 MB  | ####9      |  49% [A[A[A[A[A[A
2025-05-07T20:25:42.4116677Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  61% 
2025-05-07T20:25:42.4116945Z 
2025-05-07T20:25:42.4116953Z 
2025-05-07T20:25:42.4116957Z 
2025-05-07T20:25:42.4117147Z 
2025-05-07T20:25:42.4119665Z 
2025-05-07T20:25:42.4832803Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  51% [A[A[A[A[A
2025-05-07T20:25:42.4833243Z 
2025-05-07T20:25:42.4833250Z 
2025-05-07T20:25:42.4833255Z 
2025-05-07T20:25:42.4833282Z 
2025-05-07T20:25:42.4833288Z 
2025-05-07T20:25:42.4833293Z 
2025-05-07T20:25:42.4840494Z 
2025-05-07T20:25:42.5030841Z cuda-nvvp-12.8.57    | 112.4 MB  | ##7        |  28% [A[A[A[A[A[A[A
2025-05-07T20:25:42.5031211Z 
2025-05-07T20:25:42.5031215Z 
2025-05-07T20:25:42.5031219Z 
2025-05-07T20:25:42.5031223Z 
2025-05-07T20:25:42.5031227Z 
2025-05-07T20:25:42.5031231Z 
2025-05-07T20:25:42.5077401Z cuda-nsight-12.8.55  | 113.2 MB  | #####1     |  51% [A[A[A[A[A[A
2025-05-07T20:25:42.5155064Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  62% 
2025-05-07T20:25:42.5155332Z 
2025-05-07T20:25:42.5155336Z 
2025-05-07T20:25:42.5155340Z 
2025-05-07T20:25:42.5155344Z 
2025-05-07T20:25:42.5161401Z 
2025-05-07T20:25:42.5928655Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  53% [A[A[A[A[A
2025-05-07T20:25:42.5929255Z 
2025-05-07T20:25:42.5929260Z 
2025-05-07T20:25:42.5929263Z 
2025-05-07T20:25:42.5929267Z 
2025-05-07T20:25:42.5929271Z 
2025-05-07T20:25:42.5929275Z 
2025-05-07T20:25:42.5929412Z 
2025-05-07T20:25:42.6036013Z cuda-nvvp-12.8.57    | 112.4 MB  | ##9        |  30% [A[A[A[A[A[A[A
2025-05-07T20:25:42.6036312Z 
2025-05-07T20:25:42.6036316Z 
2025-05-07T20:25:42.6036320Z 
2025-05-07T20:25:42.6036324Z 
2025-05-07T20:25:42.6036328Z 
2025-05-07T20:25:42.6038496Z 
2025-05-07T20:25:42.6080165Z cuda-nsight-12.8.55  | 113.2 MB  | #####3     |  54% [A[A[A[A[A[A
2025-05-07T20:25:42.6159805Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:25:42.6160189Z 
2025-05-07T20:25:42.6160195Z 
2025-05-07T20:25:42.6160201Z 
2025-05-07T20:25:42.6160206Z 
2025-05-07T20:25:42.6160211Z 
2025-05-07T20:25:42.6934654Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  55% [A[A[A[A[A
2025-05-07T20:25:42.6935048Z 
2025-05-07T20:25:42.6935052Z 
2025-05-07T20:25:42.6935056Z 
2025-05-07T20:25:42.6935077Z 
2025-05-07T20:25:42.6935081Z 
2025-05-07T20:25:42.6935085Z 
2025-05-07T20:25:42.6935089Z 
2025-05-07T20:25:42.7038107Z cuda-nvvp-12.8.57    | 112.4 MB  | ###2       |  32% [A[A[A[A[A[A[A
2025-05-07T20:25:42.7038492Z 
2025-05-07T20:25:42.7038498Z 
2025-05-07T20:25:42.7038504Z 
2025-05-07T20:25:42.7038509Z 
2025-05-07T20:25:42.7038514Z 
2025-05-07T20:25:42.7039867Z 
2025-05-07T20:25:42.7101210Z cuda-nsight-12.8.55  | 113.2 MB  | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:25:42.7188992Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  63% 
2025-05-07T20:25:42.7189355Z 
2025-05-07T20:25:42.7189488Z 
2025-05-07T20:25:42.7189496Z 
2025-05-07T20:25:42.7189501Z 
2025-05-07T20:25:42.7191148Z 
2025-05-07T20:25:42.7938820Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:25:42.7939182Z 
2025-05-07T20:25:42.7939186Z 
2025-05-07T20:25:42.7939190Z 
2025-05-07T20:25:42.7939194Z 
2025-05-07T20:25:42.7939197Z 
2025-05-07T20:25:42.7939202Z 
2025-05-07T20:25:42.7939225Z 
2025-05-07T20:25:42.8039584Z cuda-nvvp-12.8.57    | 112.4 MB  | ###4       |  34% [A[A[A[A[A[A[A
2025-05-07T20:25:42.8040003Z 
2025-05-07T20:25:42.8040008Z 
2025-05-07T20:25:42.8040023Z 
2025-05-07T20:25:42.8040027Z 
2025-05-07T20:25:42.8040031Z 
2025-05-07T20:25:42.8040034Z 
2025-05-07T20:25:42.8101587Z cuda-nsight-12.8.55  | 113.2 MB  | #####7     |  58% [A[A[A[A[A[A
2025-05-07T20:25:42.8194167Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:25:42.8194511Z 
2025-05-07T20:25:42.8194517Z 
2025-05-07T20:25:42.8194523Z 
2025-05-07T20:25:42.8194528Z 
2025-05-07T20:25:42.8194534Z 
2025-05-07T20:25:42.9053956Z libnpp-12.3.3.65     | 130.6 MB  | #####8     |  59% [A[A[A[A[A
2025-05-07T20:25:42.9054332Z 
2025-05-07T20:25:42.9054336Z 
2025-05-07T20:25:42.9054340Z 
2025-05-07T20:25:42.9054344Z 
2025-05-07T20:25:42.9054348Z 
2025-05-07T20:25:42.9054352Z 
2025-05-07T20:25:42.9109872Z cuda-nsight-12.8.55  | 113.2 MB  | ######     |  61% [A[A[A[A[A[A
2025-05-07T20:25:42.9197263Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  64% 
2025-05-07T20:25:42.9197564Z 
2025-05-07T20:25:42.9197761Z 
2025-05-07T20:25:42.9197768Z 
2025-05-07T20:25:42.9197774Z 
2025-05-07T20:25:42.9199478Z 
2025-05-07T20:25:43.0058035Z libnpp-12.3.3.65     | 130.6 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:25:43.0058476Z 
2025-05-07T20:25:43.0058483Z 
2025-05-07T20:25:43.0058488Z 
2025-05-07T20:25:43.0058504Z 
2025-05-07T20:25:43.0058510Z 
2025-05-07T20:25:43.0060197Z 
2025-05-07T20:25:43.0115647Z cuda-nsight-12.8.55  | 113.2 MB  | ######2    |  63% [A[A[A[A[A[A
2025-05-07T20:25:43.0191009Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  65% 
2025-05-07T20:25:43.0191276Z 
2025-05-07T20:25:43.0191281Z 
2025-05-07T20:25:43.0191294Z 
2025-05-07T20:25:43.0191298Z 
2025-05-07T20:25:43.0191303Z 
2025-05-07T20:25:43.0191306Z 
2025-05-07T20:25:43.0191310Z 
2025-05-07T20:25:43.0199334Z cuda-nvvp-12.8.57    | 112.4 MB  | ###6       |  37% [A[A[A[A[A[A[A
2025-05-07T20:25:43.0199637Z 
2025-05-07T20:25:43.0199969Z 
2025-05-07T20:25:43.0199975Z 
2025-05-07T20:25:43.0199980Z 
2025-05-07T20:25:43.0204227Z 
2025-05-07T20:25:43.1059699Z libnpp-12.3.3.65     | 130.6 MB  | ######2    |  63% [A[A[A[A[A
2025-05-07T20:25:43.1060027Z 
2025-05-07T20:25:43.1060031Z 
2025-05-07T20:25:43.1060035Z 
2025-05-07T20:25:43.1060039Z 
2025-05-07T20:25:43.1060043Z 
2025-05-07T20:25:43.1060697Z 
2025-05-07T20:25:43.1192428Z cuda-nsight-12.8.55  | 113.2 MB  | ######5    |  65% [A[A[A[A[A[A
2025-05-07T20:25:43.1192760Z 
2025-05-07T20:25:43.1192766Z 
2025-05-07T20:25:43.1192771Z 
2025-05-07T20:25:43.1192776Z 
2025-05-07T20:25:43.1192782Z 
2025-05-07T20:25:43.1192787Z 
2025-05-07T20:25:43.1192792Z 
2025-05-07T20:25:43.1208232Z cuda-nvvp-12.8.57    | 112.4 MB  | ###9       |  39% [A[A[A[A[A[A[A
2025-05-07T20:25:43.1208643Z 
2025-05-07T20:25:43.1208650Z 
2025-05-07T20:25:43.1208655Z 
2025-05-07T20:25:43.1208660Z 
2025-05-07T20:25:43.1208666Z 
2025-05-07T20:25:43.1272766Z libnpp-12.3.3.65     | 130.6 MB  | ######4    |  65% [A[A[A[A[A
2025-05-07T20:25:43.2065599Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  65% 
2025-05-07T20:25:43.2065909Z 
2025-05-07T20:25:43.2065925Z 
2025-05-07T20:25:43.2065951Z 
2025-05-07T20:25:43.2065957Z 
2025-05-07T20:25:43.2065962Z 
2025-05-07T20:25:43.2069354Z 
2025-05-07T20:25:43.2193300Z cuda-nsight-12.8.55  | 113.2 MB  | ######7    |  68% [A[A[A[A[A[A
2025-05-07T20:25:43.2193834Z 
2025-05-07T20:25:43.2193841Z 
2025-05-07T20:25:43.2193846Z 
2025-05-07T20:25:43.2193851Z 
2025-05-07T20:25:43.2193857Z 
2025-05-07T20:25:43.2193862Z 
2025-05-07T20:25:43.2196352Z 
2025-05-07T20:25:43.2210258Z cuda-nvvp-12.8.57    | 112.4 MB  | ####1      |  42% [A[A[A[A[A[A[A
2025-05-07T20:25:43.2210634Z 
2025-05-07T20:25:43.2210639Z 
2025-05-07T20:25:43.2210645Z 
2025-05-07T20:25:43.2210650Z 
2025-05-07T20:25:43.2212565Z 
2025-05-07T20:25:43.3068249Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  67% [A[A[A[A[A
2025-05-07T20:25:43.3068662Z 
2025-05-07T20:25:43.3068668Z 
2025-05-07T20:25:43.3068691Z 
2025-05-07T20:25:43.3068696Z 
2025-05-07T20:25:43.3068701Z 
2025-05-07T20:25:43.3075355Z 
2025-05-07T20:25:43.3196649Z cuda-nsight-12.8.55  | 113.2 MB  | #######    |  70% [A[A[A[A[A[A
2025-05-07T20:25:43.3197107Z 
2025-05-07T20:25:43.3197113Z 
2025-05-07T20:25:43.3197119Z 
2025-05-07T20:25:43.3197135Z 
2025-05-07T20:25:43.3197140Z 
2025-05-07T20:25:43.3197145Z 
2025-05-07T20:25:43.3198456Z 
2025-05-07T20:25:43.3212647Z cuda-nvvp-12.8.57    | 112.4 MB  | ####4      |  45% [A[A[A[A[A[A[A
2025-05-07T20:25:43.3212962Z 
2025-05-07T20:25:43.3212966Z 
2025-05-07T20:25:43.3212970Z 
2025-05-07T20:25:43.3212974Z 
2025-05-07T20:25:43.3212978Z 
2025-05-07T20:25:43.3704996Z libnpp-12.3.3.65     | 130.6 MB  | ######9    |  69% [A[A[A[A[A
2025-05-07T20:25:43.4071145Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  66% 
2025-05-07T20:25:43.4071417Z 
2025-05-07T20:25:43.4071421Z 
2025-05-07T20:25:43.4071425Z 
2025-05-07T20:25:43.4071429Z 
2025-05-07T20:25:43.4071433Z 
2025-05-07T20:25:43.4074396Z 
2025-05-07T20:25:43.4200947Z cuda-nsight-12.8.55  | 113.2 MB  | #######3   |  74% [A[A[A[A[A[A
2025-05-07T20:25:43.4201349Z 
2025-05-07T20:25:43.4201353Z 
2025-05-07T20:25:43.4201369Z 
2025-05-07T20:25:43.4201374Z 
2025-05-07T20:25:43.4201378Z 
2025-05-07T20:25:43.4201390Z 
2025-05-07T20:25:43.4201394Z 
2025-05-07T20:25:43.4245160Z cuda-nvvp-12.8.57    | 112.4 MB  | ####7      |  47% [A[A[A[A[A[A[A
2025-05-07T20:25:43.4245457Z 
2025-05-07T20:25:43.4245462Z 
2025-05-07T20:25:43.4245466Z 
2025-05-07T20:25:43.4245477Z 
2025-05-07T20:25:43.4245481Z 
2025-05-07T20:25:43.4776649Z libnpp-12.3.3.65     | 130.6 MB  | #######1   |  71% [A[A[A[A[A
2025-05-07T20:25:43.5071310Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:25:43.5071582Z 
2025-05-07T20:25:43.5071586Z 
2025-05-07T20:25:43.5071590Z 
2025-05-07T20:25:43.5071594Z 
2025-05-07T20:25:43.5071599Z 
2025-05-07T20:25:43.5073829Z 
2025-05-07T20:25:43.5203069Z cuda-nsight-12.8.55  | 113.2 MB  | #######6   |  76% [A[A[A[A[A[A
2025-05-07T20:25:43.5203646Z 
2025-05-07T20:25:43.5203650Z 
2025-05-07T20:25:43.5203654Z 
2025-05-07T20:25:43.5203658Z 
2025-05-07T20:25:43.5203662Z 
2025-05-07T20:25:43.5203816Z 
2025-05-07T20:25:43.5204129Z 
2025-05-07T20:25:43.5247728Z cuda-nvvp-12.8.57    | 112.4 MB  | ####9      |  50% [A[A[A[A[A[A[A
2025-05-07T20:25:43.5248087Z 
2025-05-07T20:25:43.5248092Z 
2025-05-07T20:25:43.5248098Z 
2025-05-07T20:25:43.5248103Z 
2025-05-07T20:25:43.5250232Z 
2025-05-07T20:25:43.5948190Z libnpp-12.3.3.65     | 130.6 MB  | #######3   |  74% [A[A[A[A[A
2025-05-07T20:25:43.6204790Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:25:43.6205178Z 
2025-05-07T20:25:43.6205182Z 
2025-05-07T20:25:43.6205186Z 
2025-05-07T20:25:43.6205190Z 
2025-05-07T20:25:43.6205194Z 
2025-05-07T20:25:43.6205198Z 
2025-05-07T20:25:43.6207089Z 
2025-05-07T20:25:43.6284572Z cuda-nvvp-12.8.57    | 112.4 MB  | #####2     |  52% [A[A[A[A[A[A[A
2025-05-07T20:25:43.6284877Z 
2025-05-07T20:25:43.6284898Z 
2025-05-07T20:25:43.6284902Z 
2025-05-07T20:25:43.6284906Z 
2025-05-07T20:25:43.6284910Z 
2025-05-07T20:25:43.6288092Z 
2025-05-07T20:25:43.6370431Z cuda-nsight-12.8.55  | 113.2 MB  | #######9   |  79% [A[A[A[A[A[A
2025-05-07T20:25:43.6371040Z 
2025-05-07T20:25:43.6371044Z 
2025-05-07T20:25:43.6371048Z 
2025-05-07T20:25:43.6371052Z 
2025-05-07T20:25:43.6371370Z 
2025-05-07T20:25:43.6950441Z libnpp-12.3.3.65     | 130.6 MB  | #######5   |  76% [A[A[A[A[A
2025-05-07T20:25:43.7238813Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  67% 
2025-05-07T20:25:43.7239084Z 
2025-05-07T20:25:43.7239088Z 
2025-05-07T20:25:43.7239092Z 
2025-05-07T20:25:43.7239096Z 
2025-05-07T20:25:43.7239100Z 
2025-05-07T20:25:43.7239104Z 
2025-05-07T20:25:43.7244782Z 
2025-05-07T20:25:43.7367615Z cuda-nvvp-12.8.57    | 112.4 MB  | #####4     |  55% [A[A[A[A[A[A[A
2025-05-07T20:25:43.7367916Z 
2025-05-07T20:25:43.7367921Z 
2025-05-07T20:25:43.7367924Z 
2025-05-07T20:25:43.7367928Z 
2025-05-07T20:25:43.7367932Z 
2025-05-07T20:25:43.7367952Z 
2025-05-07T20:25:43.7477115Z cuda-nsight-12.8.55  | 113.2 MB  | ########1  |  82% [A[A[A[A[A[A
2025-05-07T20:25:43.7477428Z 
2025-05-07T20:25:43.7477444Z 
2025-05-07T20:25:43.7477448Z 
2025-05-07T20:25:43.7477452Z 
2025-05-07T20:25:43.7477456Z 
2025-05-07T20:25:43.7957829Z libnpp-12.3.3.65     | 130.6 MB  | #######8   |  78% [A[A[A[A[A
2025-05-07T20:25:43.8331759Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  68% 
2025-05-07T20:25:43.8332119Z 
2025-05-07T20:25:43.8332125Z 
2025-05-07T20:25:43.8332130Z 
2025-05-07T20:25:43.8332136Z 
2025-05-07T20:25:43.8332142Z 
2025-05-07T20:25:43.8332147Z 
2025-05-07T20:25:43.8332153Z 
2025-05-07T20:25:43.8559950Z cuda-nvvp-12.8.57    | 112.4 MB  | #####7     |  57% [A[A[A[A[A[A[A
2025-05-07T20:25:43.8560399Z 
2025-05-07T20:25:43.8560405Z 
2025-05-07T20:25:43.8560411Z 
2025-05-07T20:25:43.8560416Z 
2025-05-07T20:25:43.8560421Z 
2025-05-07T20:25:43.8560437Z 
2025-05-07T20:25:43.8691002Z cuda-nsight-12.8.55  | 113.2 MB  | ########4  |  84% [A[A[A[A[A[A
2025-05-07T20:25:43.8691450Z 
2025-05-07T20:25:43.8691455Z 
2025-05-07T20:25:43.8691460Z 
2025-05-07T20:25:43.8691466Z 
2025-05-07T20:25:43.8691481Z 
2025-05-07T20:25:43.9118926Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  80% [A[A[A[A[A
2025-05-07T20:25:43.9512056Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:25:43.9512372Z 
2025-05-07T20:25:43.9512607Z 
2025-05-07T20:25:43.9512611Z 
2025-05-07T20:25:43.9512615Z 
2025-05-07T20:25:43.9512619Z 
2025-05-07T20:25:43.9514365Z 
2025-05-07T20:25:43.9514369Z 
2025-05-07T20:25:43.9693977Z cuda-nvvp-12.8.57    | 112.4 MB  | #####9     |  60% [A[A[A[A[A[A[A
2025-05-07T20:25:43.9694280Z 
2025-05-07T20:25:43.9694292Z 
2025-05-07T20:25:43.9694296Z 
2025-05-07T20:25:43.9694300Z 
2025-05-07T20:25:43.9694304Z 
2025-05-07T20:25:43.9717880Z libnpp-12.3.3.65     | 130.6 MB  | ########1  |  82% [A[A[A[A[A
2025-05-07T20:25:43.9718223Z 
2025-05-07T20:25:43.9718239Z 
2025-05-07T20:25:43.9718244Z 
2025-05-07T20:25:43.9718519Z 
2025-05-07T20:25:43.9718522Z 
2025-05-07T20:25:43.9721399Z 
2025-05-07T20:25:44.0123124Z cuda-nsight-12.8.55  | 113.2 MB  | ########6  |  87% [A[A[A[A[A[A
2025-05-07T20:25:44.0639541Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  69% 
2025-05-07T20:25:44.0639968Z 
2025-05-07T20:25:44.0639975Z 
2025-05-07T20:25:44.0639980Z 
2025-05-07T20:25:44.0639985Z 
2025-05-07T20:25:44.0639990Z 
2025-05-07T20:25:44.0640006Z 
2025-05-07T20:25:44.0643029Z 
2025-05-07T20:25:44.0696138Z cuda-nvvp-12.8.57    | 112.4 MB  | ######1    |  62% [A[A[A[A[A[A[A
2025-05-07T20:25:44.0696440Z 
2025-05-07T20:25:44.0696444Z 
2025-05-07T20:25:44.0696456Z 
2025-05-07T20:25:44.0696460Z 
2025-05-07T20:25:44.0698374Z 
2025-05-07T20:25:44.0728815Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  84% [A[A[A[A[A
2025-05-07T20:25:44.0729104Z 
2025-05-07T20:25:44.0729115Z 
2025-05-07T20:25:44.0729120Z 
2025-05-07T20:25:44.0729124Z 
2025-05-07T20:25:44.0729127Z 
2025-05-07T20:25:44.0729131Z 
2025-05-07T20:25:44.1127164Z cuda-nsight-12.8.55  | 113.2 MB  | ########8  |  89% [A[A[A[A[A[A
2025-05-07T20:25:44.1788371Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:25:44.1788749Z 
2025-05-07T20:25:44.1788754Z 
2025-05-07T20:25:44.1788758Z 
2025-05-07T20:25:44.1788762Z 
2025-05-07T20:25:44.1788766Z 
2025-05-07T20:25:44.1849205Z libnpp-12.3.3.65     | 130.6 MB  | ########5  |  86% [A[A[A[A[A
2025-05-07T20:25:44.1849536Z 
2025-05-07T20:25:44.1849542Z 
2025-05-07T20:25:44.1849547Z 
2025-05-07T20:25:44.1849552Z 
2025-05-07T20:25:44.1849557Z 
2025-05-07T20:25:44.1849562Z 
2025-05-07T20:25:44.1864555Z cuda-nsight-12.8.55  | 113.2 MB  | #########1 |  91% [A[A[A[A[A[A
2025-05-07T20:25:44.1864860Z 
2025-05-07T20:25:44.1864864Z 
2025-05-07T20:25:44.1864868Z 
2025-05-07T20:25:44.1864872Z 
2025-05-07T20:25:44.1864876Z 
2025-05-07T20:25:44.1864888Z 
2025-05-07T20:25:44.1864892Z 
2025-05-07T20:25:44.2129379Z cuda-nvvp-12.8.57    | 112.4 MB  | ######4    |  64% [A[A[A[A[A[A[A
2025-05-07T20:25:44.2845118Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:25:44.2845388Z 
2025-05-07T20:25:44.2845392Z 
2025-05-07T20:25:44.2845396Z 
2025-05-07T20:25:44.2845413Z 
2025-05-07T20:25:44.2847495Z 
2025-05-07T20:25:44.2868930Z libnpp-12.3.3.65     | 130.6 MB  | ########7  |  88% [A[A[A[A[A
2025-05-07T20:25:44.2869322Z 
2025-05-07T20:25:44.2869326Z 
2025-05-07T20:25:44.2869330Z 
2025-05-07T20:25:44.2869334Z 
2025-05-07T20:25:44.2869338Z 
2025-05-07T20:25:44.2869341Z 
2025-05-07T20:25:44.2872574Z 
2025-05-07T20:25:44.2968228Z cuda-nvvp-12.8.57    | 112.4 MB  | ######6    |  66% [A[A[A[A[A[A[A
2025-05-07T20:25:44.2968692Z 
2025-05-07T20:25:44.2968699Z 
2025-05-07T20:25:44.2968704Z 
2025-05-07T20:25:44.2968710Z 
2025-05-07T20:25:44.2968716Z 
2025-05-07T20:25:44.2970674Z 
2025-05-07T20:25:44.3134327Z cuda-nsight-12.8.55  | 113.2 MB  | #########3 |  94% [A[A[A[A[A[A
2025-05-07T20:25:44.3849199Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:25:44.3849507Z 
2025-05-07T20:25:44.3849512Z 
2025-05-07T20:25:44.3849516Z 
2025-05-07T20:25:44.3849519Z 
2025-05-07T20:25:44.3849523Z 
2025-05-07T20:25:44.3870437Z libnpp-12.3.3.65     | 130.6 MB  | ########9  |  90% [A[A[A[A[A
2025-05-07T20:25:44.3871135Z 
2025-05-07T20:25:44.3871164Z 
2025-05-07T20:25:44.3871168Z 
2025-05-07T20:25:44.3871172Z 
2025-05-07T20:25:44.3871189Z 
2025-05-07T20:25:44.3871200Z 
2025-05-07T20:25:44.3872379Z 
2025-05-07T20:25:44.3982299Z cuda-nvvp-12.8.57    | 112.4 MB  | ######8    |  69% [A[A[A[A[A[A[A
2025-05-07T20:25:44.3982613Z 
2025-05-07T20:25:44.3982617Z 
2025-05-07T20:25:44.3982621Z 
2025-05-07T20:25:44.3982624Z 
2025-05-07T20:25:44.3982628Z 
2025-05-07T20:25:44.3985044Z 
2025-05-07T20:25:44.4143706Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  96% [A[A[A[A[A[A
2025-05-07T20:25:44.4852465Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  71% 
2025-05-07T20:25:44.4852852Z 
2025-05-07T20:25:44.4852857Z 
2025-05-07T20:25:44.4852862Z 
2025-05-07T20:25:44.4852867Z 
2025-05-07T20:25:44.4853153Z 
2025-05-07T20:25:44.4874706Z libnpp-12.3.3.65     | 130.6 MB  | #########1 |  92% [A[A[A[A[A
2025-05-07T20:25:44.4875077Z 
2025-05-07T20:25:44.4875083Z 
2025-05-07T20:25:44.4875340Z 
2025-05-07T20:25:44.4875347Z 
2025-05-07T20:25:44.4875352Z 
2025-05-07T20:25:44.4875358Z 
2025-05-07T20:25:44.4879742Z 
2025-05-07T20:25:44.4988728Z cuda-nvvp-12.8.57    | 112.4 MB  | #######1   |  71% [A[A[A[A[A[A[A
2025-05-07T20:25:44.4989176Z 
2025-05-07T20:25:44.4989182Z 
2025-05-07T20:25:44.4989188Z 
2025-05-07T20:25:44.4989194Z 
2025-05-07T20:25:44.4989199Z 
2025-05-07T20:25:44.4990761Z 
2025-05-07T20:25:44.5143786Z cuda-nsight-12.8.55  | 113.2 MB  | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:25:44.5856127Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:25:44.5856398Z 
2025-05-07T20:25:44.5856410Z 
2025-05-07T20:25:44.5856415Z 
2025-05-07T20:25:44.5856418Z 
2025-05-07T20:25:44.5856426Z 
2025-05-07T20:25:44.5896253Z libnpp-12.3.3.65     | 130.6 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:25:44.5896561Z 
2025-05-07T20:25:44.5896572Z 
2025-05-07T20:25:44.5896576Z 
2025-05-07T20:25:44.5896580Z 
2025-05-07T20:25:44.5896583Z 
2025-05-07T20:25:44.5896597Z 
2025-05-07T20:25:44.5899626Z 
2025-05-07T20:25:44.6188659Z cuda-nvvp-12.8.57    | 112.4 MB  | #######3   |  73% [A[A[A[A[A[A[A
2025-05-07T20:25:44.6857826Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:25:44.6858107Z 
2025-05-07T20:25:44.6858111Z 
2025-05-07T20:25:44.6858115Z 
2025-05-07T20:25:44.6858119Z 
2025-05-07T20:25:44.6858130Z 
2025-05-07T20:25:44.6900622Z libnpp-12.3.3.65     | 130.6 MB  | #########6 |  96% [A[A[A[A[A
2025-05-07T20:25:44.6901062Z 
2025-05-07T20:25:44.6901068Z 
2025-05-07T20:25:44.6901074Z 
2025-05-07T20:25:44.6901079Z 
2025-05-07T20:25:44.6901084Z 
2025-05-07T20:25:44.6901090Z 
2025-05-07T20:25:44.6903038Z 
2025-05-07T20:25:44.7766089Z cuda-nvvp-12.8.57    | 112.4 MB  | #######5   |  76% [A[A[A[A[A[A[A
2025-05-07T20:25:44.7865763Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:25:44.7866053Z 
2025-05-07T20:25:44.7866058Z 
2025-05-07T20:25:44.7866062Z 
2025-05-07T20:25:44.7866066Z 
2025-05-07T20:25:44.7868352Z 
2025-05-07T20:25:44.8174613Z libnpp-12.3.3.65     | 130.6 MB  | #########8 |  99% [A[A[A[A[A
2025-05-07T20:25:44.8174954Z 
2025-05-07T20:25:44.8174960Z 
2025-05-07T20:25:44.8174966Z 
2025-05-07T20:25:44.8174971Z 
2025-05-07T20:25:44.8174976Z 
2025-05-07T20:25:44.8174981Z 
2025-05-07T20:25:44.8174987Z 
2025-05-07T20:25:44.8875343Z cuda-nvvp-12.8.57    | 112.4 MB  | #######7   |  78% [A[A[A[A[A[A[A
2025-05-07T20:25:44.9174881Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:25:44.9175161Z 
2025-05-07T20:25:44.9175165Z 
2025-05-07T20:25:44.9175169Z 
2025-05-07T20:25:44.9175173Z 
2025-05-07T20:25:44.9175177Z 
2025-05-07T20:25:44.9175181Z 
2025-05-07T20:25:44.9175185Z 
2025-05-07T20:25:44.9876709Z cuda-nvvp-12.8.57    | 112.4 MB  | ########   |  81% [A[A[A[A[A[A[A
2025-05-07T20:25:45.0176950Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  74% 
2025-05-07T20:25:45.0177239Z 
2025-05-07T20:25:45.0177243Z 
2025-05-07T20:25:45.0177247Z 
2025-05-07T20:25:45.0177251Z 
2025-05-07T20:25:45.0177270Z 
2025-05-07T20:25:45.0177275Z 
2025-05-07T20:25:45.0183057Z 
2025-05-07T20:25:45.0878753Z cuda-nvvp-12.8.57    | 112.4 MB  | ########3  |  83% [A[A[A[A[A[A[A
2025-05-07T20:25:45.1181322Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:25:45.1181921Z 
2025-05-07T20:25:45.1181925Z 
2025-05-07T20:25:45.1181929Z 
2025-05-07T20:25:45.1181933Z 
2025-05-07T20:25:45.1181937Z 
2025-05-07T20:25:45.1181941Z 
2025-05-07T20:25:45.1183437Z 
2025-05-07T20:25:45.1886030Z cuda-nvvp-12.8.57    | 112.4 MB  | ########6  |  86% [A[A[A[A[A[A[A
2025-05-07T20:25:45.2187660Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  75% 
2025-05-07T20:25:45.2188036Z 
2025-05-07T20:25:45.2188042Z 
2025-05-07T20:25:45.2188048Z 
2025-05-07T20:25:45.2188053Z 
2025-05-07T20:25:45.2188058Z 
2025-05-07T20:25:45.2188371Z 
2025-05-07T20:25:45.2188376Z 
2025-05-07T20:25:45.2888873Z cuda-nvvp-12.8.57    | 112.4 MB  | ########8  |  89% [A[A[A[A[A[A[A
2025-05-07T20:25:45.3273675Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:25:45.3274043Z 
2025-05-07T20:25:45.3274048Z 
2025-05-07T20:25:45.3274051Z 
2025-05-07T20:25:45.3274055Z 
2025-05-07T20:25:45.3274059Z 
2025-05-07T20:25:45.3274063Z 
2025-05-07T20:25:45.3274067Z 
2025-05-07T20:25:45.3894610Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  91% [A[A[A[A[A[A[A
2025-05-07T20:25:45.4275907Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:25:45.4276627Z 
2025-05-07T20:25:45.4276632Z 
2025-05-07T20:25:45.4276643Z 
2025-05-07T20:25:45.4276647Z 
2025-05-07T20:25:45.4276651Z 
2025-05-07T20:25:45.4276655Z 
2025-05-07T20:25:45.4276659Z 
2025-05-07T20:25:45.4895537Z cuda-nvvp-12.8.57    | 112.4 MB  | #########3 |  94% [A[A[A[A[A[A[A
2025-05-07T20:25:45.5276855Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:25:45.5277289Z 
2025-05-07T20:25:45.5277294Z 
2025-05-07T20:25:45.5277298Z 
2025-05-07T20:25:45.5277302Z 
2025-05-07T20:25:45.5277306Z 
2025-05-07T20:25:45.5277319Z 
2025-05-07T20:25:45.5277337Z 
2025-05-07T20:25:45.6037299Z cuda-nvvp-12.8.57    | 112.4 MB  | #########6 |  96% [A[A[A[A[A[A[A
2025-05-07T20:25:45.6289545Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:25:45.6289932Z 
2025-05-07T20:25:45.6289939Z 
2025-05-07T20:25:45.6289944Z 
2025-05-07T20:25:45.6289950Z 
2025-05-07T20:25:45.6289955Z 
2025-05-07T20:25:45.6289961Z 
2025-05-07T20:25:45.6291757Z 
2025-05-07T20:25:45.7048103Z cuda-nvvp-12.8.57    | 112.4 MB  | #########9 |  99% [A[A[A[A[A[A[A
2025-05-07T20:25:45.8050133Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:25:45.9068463Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:25:46.0072750Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:25:46.1150364Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:25:46.2153745Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:25:46.3180588Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:25:46.4278585Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:25:46.5292313Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:25:46.6603390Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:25:46.7619246Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  84% 
2025-05-07T20:25:46.8621037Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:25:46.9622252Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:25:47.0623353Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:25:47.1626374Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:25:47.2629721Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  87% 
2025-05-07T20:25:47.3634090Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  88% 
2025-05-07T20:25:47.4634781Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  89% 
2025-05-07T20:25:47.5635207Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:25:47.6751820Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  90% 
2025-05-07T20:25:47.8263698Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  91% 
2025-05-07T20:25:47.9270462Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:25:48.0276623Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  92% 
2025-05-07T20:25:48.1276899Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:25:48.2281439Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  93% 
2025-05-07T20:25:48.3284592Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  94% 
2025-05-07T20:25:48.4288642Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  95% 
2025-05-07T20:25:48.5355095Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  96% 
2025-05-07T20:25:48.6375450Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  96% 
2025-05-07T20:25:48.7390067Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:25:48.8384147Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  98% 
2025-05-07T20:25:48.8384455Z 
2025-05-07T20:25:48.8384690Z 
2025-05-07T20:25:48.8384696Z 
2025-05-07T20:25:48.8384700Z 
2025-05-07T20:25:48.8384704Z 
2025-05-07T20:25:48.8388019Z 
2025-05-07T20:25:48.8394912Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:48.8713362Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  99% 
2025-05-07T20:25:48.8713935Z 
2025-05-07T20:25:48.8713942Z 
2025-05-07T20:25:48.8713949Z 
2025-05-07T20:25:48.8713955Z 
2025-05-07T20:25:48.8713961Z 
2025-05-07T20:25:48.8713967Z 
2025-05-07T20:25:48.8713974Z 
2025-05-07T20:25:48.8714766Z 
2025-05-07T20:25:48.9621631Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:48.9621990Z 
2025-05-07T20:25:48.9621995Z 
2025-05-07T20:25:48.9621999Z 
2025-05-07T20:25:48.9624231Z 
2025-05-07T20:25:48.9631890Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:25:48.9721051Z libcublas-12.8.3.14  | 460.2 MB  | #########9 |  99% 
2025-05-07T20:25:48.9721333Z 
2025-05-07T20:25:48.9721338Z 
2025-05-07T20:25:48.9721355Z 
2025-05-07T20:25:48.9721359Z 
2025-05-07T20:25:48.9721364Z 
2025-05-07T20:25:48.9721368Z 
2025-05-07T20:25:48.9721372Z 
2025-05-07T20:25:48.9723106Z 
2025-05-07T20:25:49.0726494Z cuda-nvrtc-12.8.61   | 63.1 MB   | 4          |   5% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.0726942Z 
2025-05-07T20:25:49.0728901Z 
2025-05-07T20:25:49.0728907Z 
2025-05-07T20:25:49.0729152Z 
2025-05-07T20:25:49.0729159Z 
2025-05-07T20:25:49.0729164Z 
2025-05-07T20:25:49.0729168Z 
2025-05-07T20:25:49.0729176Z 
2025-05-07T20:25:49.0767615Z cuda-nvrtc-12.8.61   | 63.1 MB   | 9          |  10% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.1760747Z libcublas-12.8.3.14  | 460.2 MB  | #########9 | 100% 
2025-05-07T20:25:49.1761144Z 
2025-05-07T20:25:49.1761150Z 
2025-05-07T20:25:49.1761156Z 
2025-05-07T20:25:49.1761161Z 
2025-05-07T20:25:49.1761196Z 
2025-05-07T20:25:49.1761202Z 
2025-05-07T20:25:49.1761208Z 
2025-05-07T20:25:49.1761219Z 
2025-05-07T20:25:49.2767744Z cuda-nvrtc-12.8.61   | 63.1 MB   | #4         |  15% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.2768159Z 
2025-05-07T20:25:49.2768163Z 
2025-05-07T20:25:49.2768167Z 
2025-05-07T20:25:49.2768171Z 
2025-05-07T20:25:49.2768175Z 
2025-05-07T20:25:49.2768179Z 
2025-05-07T20:25:49.2768183Z 
2025-05-07T20:25:49.2768187Z 
2025-05-07T20:25:49.3877756Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##         |  20% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.3878104Z 
2025-05-07T20:25:49.3878109Z 
2025-05-07T20:25:49.3878113Z 
2025-05-07T20:25:49.3878117Z 
2025-05-07T20:25:49.3878121Z 
2025-05-07T20:25:49.3878125Z 
2025-05-07T20:25:49.3878129Z 
2025-05-07T20:25:49.3882131Z 
2025-05-07T20:25:49.4878141Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##5        |  25% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.4878479Z 
2025-05-07T20:25:49.4878484Z 
2025-05-07T20:25:49.4878488Z 
2025-05-07T20:25:49.4878513Z 
2025-05-07T20:25:49.4878517Z 
2025-05-07T20:25:49.4878521Z 
2025-05-07T20:25:49.4878525Z 
2025-05-07T20:25:49.4879510Z 
2025-05-07T20:25:49.5930146Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###        |  30% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.5930578Z 
2025-05-07T20:25:49.5930583Z 
2025-05-07T20:25:49.5930587Z 
2025-05-07T20:25:49.5930597Z 
2025-05-07T20:25:49.5930601Z 
2025-05-07T20:25:49.5930605Z 
2025-05-07T20:25:49.5930609Z 
2025-05-07T20:25:49.5933736Z 
2025-05-07T20:25:49.6937414Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###5       |  35% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.6937799Z 
2025-05-07T20:25:49.6937806Z 
2025-05-07T20:25:49.6937812Z 
2025-05-07T20:25:49.6937817Z 
2025-05-07T20:25:49.6937835Z 
2025-05-07T20:25:49.6937839Z 
2025-05-07T20:25:49.6937842Z 
2025-05-07T20:25:49.6937846Z 
2025-05-07T20:25:49.7939278Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###9       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.7939690Z 
2025-05-07T20:25:49.7939697Z 
2025-05-07T20:25:49.7939702Z 
2025-05-07T20:25:49.7939975Z 
2025-05-07T20:25:49.7939981Z 
2025-05-07T20:25:49.7939986Z 
2025-05-07T20:25:49.7939992Z 
2025-05-07T20:25:49.7942444Z 
2025-05-07T20:25:49.8942094Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####5      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.8942600Z 
2025-05-07T20:25:49.8942606Z 
2025-05-07T20:25:49.8942612Z 
2025-05-07T20:25:49.8942617Z 
2025-05-07T20:25:49.8942623Z 
2025-05-07T20:25:49.8942628Z 
2025-05-07T20:25:49.8942633Z 
2025-05-07T20:25:49.8942639Z 
2025-05-07T20:25:49.9429940Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####1     |  52% [A[A[A[A[A[A[A[A
2025-05-07T20:25:49.9430422Z 
2025-05-07T20:25:49.9430428Z 
2025-05-07T20:25:49.9430434Z 
2025-05-07T20:25:49.9430439Z 
2025-05-07T20:25:49.9431959Z 
2025-05-07T20:25:49.9751632Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:49.9751947Z 
2025-05-07T20:25:49.9751951Z 
2025-05-07T20:25:49.9751954Z 
2025-05-07T20:25:49.9751958Z 
2025-05-07T20:25:49.9751962Z 
2025-05-07T20:25:49.9751966Z 
2025-05-07T20:25:49.9755739Z 
2025-05-07T20:25:49.9777745Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:49.9778050Z 
2025-05-07T20:25:49.9778069Z 
2025-05-07T20:25:49.9778075Z 
2025-05-07T20:25:49.9778080Z 
2025-05-07T20:25:49.9778085Z 
2025-05-07T20:25:49.9778090Z 
2025-05-07T20:25:49.9778095Z 
2025-05-07T20:25:49.9778100Z 
2025-05-07T20:25:49.9778105Z 
2025-05-07T20:25:49.9983264Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:49.9983638Z 
2025-05-07T20:25:49.9983642Z 
2025-05-07T20:25:49.9983646Z 
2025-05-07T20:25:49.9983650Z 
2025-05-07T20:25:49.9983653Z 
2025-05-07T20:25:49.9983657Z 
2025-05-07T20:25:49.9983661Z 
2025-05-07T20:25:49.9986252Z 
2025-05-07T20:25:50.0256802Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####6     |  57% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0257155Z 
2025-05-07T20:25:50.0257159Z 
2025-05-07T20:25:50.0257163Z 
2025-05-07T20:25:50.0257167Z 
2025-05-07T20:25:50.0257179Z 
2025-05-07T20:25:50.0257213Z 
2025-05-07T20:25:50.0257218Z 
2025-05-07T20:25:50.0257222Z 
2025-05-07T20:25:50.0257226Z 
2025-05-07T20:25:50.0257230Z 
2025-05-07T20:25:50.0780187Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.0780549Z 
2025-05-07T20:25:50.0780554Z 
2025-05-07T20:25:50.0780559Z 
2025-05-07T20:25:50.0780565Z 
2025-05-07T20:25:50.0780570Z 
2025-05-07T20:25:50.0780575Z 
2025-05-07T20:25:50.0780580Z 
2025-05-07T20:25:50.0780586Z 
2025-05-07T20:25:50.0786986Z 
2025-05-07T20:25:50.1096452Z libcurand-10.3.9.55  | 43.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.1096783Z 
2025-05-07T20:25:50.1096789Z 
2025-05-07T20:25:50.1096794Z 
2025-05-07T20:25:50.1096799Z 
2025-05-07T20:25:50.1096804Z 
2025-05-07T20:25:50.1096809Z 
2025-05-07T20:25:50.1096815Z 
2025-05-07T20:25:50.1098477Z 
2025-05-07T20:25:50.1258088Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######2    |  62% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.1258557Z 
2025-05-07T20:25:50.1258594Z 
2025-05-07T20:25:50.1258600Z 
2025-05-07T20:25:50.1258605Z 
2025-05-07T20:25:50.1258610Z 
2025-05-07T20:25:50.1258616Z 
2025-05-07T20:25:50.1258622Z 
2025-05-07T20:25:50.1258640Z 
2025-05-07T20:25:50.1258646Z 
2025-05-07T20:25:50.1258651Z 
2025-05-07T20:25:50.2051372Z gds-tools-1.13.0.11  | 37.9 MB   | 7          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2051784Z 
2025-05-07T20:25:50.2051788Z 
2025-05-07T20:25:50.2051800Z 
2025-05-07T20:25:50.2051804Z 
2025-05-07T20:25:50.2051808Z 
2025-05-07T20:25:50.2051812Z 
2025-05-07T20:25:50.2051816Z 
2025-05-07T20:25:50.2051820Z 
2025-05-07T20:25:50.2051824Z 
2025-05-07T20:25:50.2245735Z libcurand-10.3.9.55  | 43.6 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2246084Z 
2025-05-07T20:25:50.2246088Z 
2025-05-07T20:25:50.2246091Z 
2025-05-07T20:25:50.2246095Z 
2025-05-07T20:25:50.2246099Z 
2025-05-07T20:25:50.2246103Z 
2025-05-07T20:25:50.2246107Z 
2025-05-07T20:25:50.2249398Z 
2025-05-07T20:25:50.2286173Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######7    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.2286847Z 
2025-05-07T20:25:50.2286853Z 
2025-05-07T20:25:50.2286858Z 
2025-05-07T20:25:50.2286864Z 
2025-05-07T20:25:50.2287071Z 
2025-05-07T20:25:50.2287079Z 
2025-05-07T20:25:50.2287085Z 
2025-05-07T20:25:50.2287090Z 
2025-05-07T20:25:50.2287096Z 
2025-05-07T20:25:50.2287115Z 
2025-05-07T20:25:50.3098772Z gds-tools-1.13.0.11  | 37.9 MB   | #5         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.3099097Z 
2025-05-07T20:25:50.3099102Z 
2025-05-07T20:25:50.3099105Z 
2025-05-07T20:25:50.3099118Z 
2025-05-07T20:25:50.3099122Z 
2025-05-07T20:25:50.3099126Z 
2025-05-07T20:25:50.3099129Z 
2025-05-07T20:25:50.3099133Z 
2025-05-07T20:25:50.3100591Z 
2025-05-07T20:25:50.3357246Z libcurand-10.3.9.55  | 43.6 MB   | #9         |  20% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.3357606Z 
2025-05-07T20:25:50.3357610Z 
2025-05-07T20:25:50.3357614Z 
2025-05-07T20:25:50.3357617Z 
2025-05-07T20:25:50.3357621Z 
2025-05-07T20:25:50.3357643Z 
2025-05-07T20:25:50.3357647Z 
2025-05-07T20:25:50.3357650Z 
2025-05-07T20:25:50.3357655Z 
2025-05-07T20:25:50.3357659Z 
2025-05-07T20:25:50.3503612Z gds-tools-1.13.0.11  | 37.9 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.3503939Z 
2025-05-07T20:25:50.3503943Z 
2025-05-07T20:25:50.3503946Z 
2025-05-07T20:25:50.3503950Z 
2025-05-07T20:25:50.3503954Z 
2025-05-07T20:25:50.3503958Z 
2025-05-07T20:25:50.3503962Z 
2025-05-07T20:25:50.3503965Z 
2025-05-07T20:25:50.4151200Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######2   |  72% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4151519Z 
2025-05-07T20:25:50.4151523Z 
2025-05-07T20:25:50.4151527Z 
2025-05-07T20:25:50.4151531Z 
2025-05-07T20:25:50.4151535Z 
2025-05-07T20:25:50.4151539Z 
2025-05-07T20:25:50.4151543Z 
2025-05-07T20:25:50.4151547Z 
2025-05-07T20:25:50.4157773Z 
2025-05-07T20:25:50.4370979Z libcurand-10.3.9.55  | 43.6 MB   | ##5        |  26% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4371307Z 
2025-05-07T20:25:50.4371332Z 
2025-05-07T20:25:50.4371347Z 
2025-05-07T20:25:50.4371353Z 
2025-05-07T20:25:50.4371359Z 
2025-05-07T20:25:50.4371366Z 
2025-05-07T20:25:50.4371373Z 
2025-05-07T20:25:50.4371402Z 
2025-05-07T20:25:50.4371407Z 
2025-05-07T20:25:50.4371413Z 
2025-05-07T20:25:50.4732414Z gds-tools-1.13.0.11  | 37.9 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.4732741Z 
2025-05-07T20:25:50.4732753Z 
2025-05-07T20:25:50.4732757Z 
2025-05-07T20:25:50.4732761Z 
2025-05-07T20:25:50.4732765Z 
2025-05-07T20:25:50.4732769Z 
2025-05-07T20:25:50.4732773Z 
2025-05-07T20:25:50.4732776Z 
2025-05-07T20:25:50.5169366Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######6   |  77% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.5169772Z 
2025-05-07T20:25:50.5169776Z 
2025-05-07T20:25:50.5169780Z 
2025-05-07T20:25:50.5169784Z 
2025-05-07T20:25:50.5169788Z 
2025-05-07T20:25:50.5169792Z 
2025-05-07T20:25:50.5169796Z 
2025-05-07T20:25:50.5169799Z 
2025-05-07T20:25:50.5172075Z 
2025-05-07T20:25:50.5396251Z libcurand-10.3.9.55  | 43.6 MB   | ###1       |  31% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.5396766Z 
2025-05-07T20:25:50.5396772Z 
2025-05-07T20:25:50.5396777Z 
2025-05-07T20:25:50.5396799Z 
2025-05-07T20:25:50.5396804Z 
2025-05-07T20:25:50.5396810Z 
2025-05-07T20:25:50.5396815Z 
2025-05-07T20:25:50.5396820Z 
2025-05-07T20:25:50.5396826Z 
2025-05-07T20:25:50.5396831Z 
2025-05-07T20:25:50.5917285Z gds-tools-1.13.0.11  | 37.9 MB   | ###8       |  38% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.5917648Z 
2025-05-07T20:25:50.5917652Z 
2025-05-07T20:25:50.5917656Z 
2025-05-07T20:25:50.5917660Z 
2025-05-07T20:25:50.5917664Z 
2025-05-07T20:25:50.5917667Z 
2025-05-07T20:25:50.5917671Z 
2025-05-07T20:25:50.5919321Z 
2025-05-07T20:25:50.6190758Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########1  |  81% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6191069Z 
2025-05-07T20:25:50.6191073Z 
2025-05-07T20:25:50.6191077Z 
2025-05-07T20:25:50.6191081Z 
2025-05-07T20:25:50.6191085Z 
2025-05-07T20:25:50.6191089Z 
2025-05-07T20:25:50.6191392Z 
2025-05-07T20:25:50.6191412Z 
2025-05-07T20:25:50.6193814Z 
2025-05-07T20:25:50.6398622Z libcurand-10.3.9.55  | 43.6 MB   | ###7       |  38% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6399140Z 
2025-05-07T20:25:50.6399152Z 
2025-05-07T20:25:50.6399156Z 
2025-05-07T20:25:50.6399160Z 
2025-05-07T20:25:50.6399163Z 
2025-05-07T20:25:50.6399167Z 
2025-05-07T20:25:50.6399171Z 
2025-05-07T20:25:50.6399174Z 
2025-05-07T20:25:50.6399178Z 
2025-05-07T20:25:50.6402708Z 
2025-05-07T20:25:50.6917578Z gds-tools-1.13.0.11  | 37.9 MB   | ####5      |  46% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.6917905Z 
2025-05-07T20:25:50.6917909Z 
2025-05-07T20:25:50.6917913Z 
2025-05-07T20:25:50.6917917Z 
2025-05-07T20:25:50.6917921Z 
2025-05-07T20:25:50.6917924Z 
2025-05-07T20:25:50.6917928Z 
2025-05-07T20:25:50.6922196Z 
2025-05-07T20:25:50.7231261Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########5  |  86% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.7231639Z 
2025-05-07T20:25:50.7231643Z 
2025-05-07T20:25:50.7231672Z 
2025-05-07T20:25:50.7231676Z 
2025-05-07T20:25:50.7231680Z 
2025-05-07T20:25:50.7231684Z 
2025-05-07T20:25:50.7231688Z 
2025-05-07T20:25:50.7231692Z 
2025-05-07T20:25:50.7235163Z 
2025-05-07T20:25:50.7403044Z libcurand-10.3.9.55  | 43.6 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.7403387Z 
2025-05-07T20:25:50.7403393Z 
2025-05-07T20:25:50.7403398Z 
2025-05-07T20:25:50.7403404Z 
2025-05-07T20:25:50.7403409Z 
2025-05-07T20:25:50.7403414Z 
2025-05-07T20:25:50.7403419Z 
2025-05-07T20:25:50.7403424Z 
2025-05-07T20:25:50.7403429Z 
2025-05-07T20:25:50.7406021Z 
2025-05-07T20:25:50.7937269Z gds-tools-1.13.0.11  | 37.9 MB   | #####3     |  53% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.7937585Z 
2025-05-07T20:25:50.7937597Z 
2025-05-07T20:25:50.7937601Z 
2025-05-07T20:25:50.7937605Z 
2025-05-07T20:25:50.7937609Z 
2025-05-07T20:25:50.7937612Z 
2025-05-07T20:25:50.7937616Z 
2025-05-07T20:25:50.7937624Z 
2025-05-07T20:25:50.8240472Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########9  |  90% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8240960Z 
2025-05-07T20:25:50.8240966Z 
2025-05-07T20:25:50.8240972Z 
2025-05-07T20:25:50.8240977Z 
2025-05-07T20:25:50.8240982Z 
2025-05-07T20:25:50.8240999Z 
2025-05-07T20:25:50.8241005Z 
2025-05-07T20:25:50.8241010Z 
2025-05-07T20:25:50.8244207Z 
2025-05-07T20:25:50.8536229Z libcurand-10.3.9.55  | 43.6 MB   | ####9      |  49% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8536650Z 
2025-05-07T20:25:50.8536657Z 
2025-05-07T20:25:50.8536661Z 
2025-05-07T20:25:50.8536665Z 
2025-05-07T20:25:50.8536669Z 
2025-05-07T20:25:50.8536673Z 
2025-05-07T20:25:50.8536676Z 
2025-05-07T20:25:50.8536680Z 
2025-05-07T20:25:50.8536684Z 
2025-05-07T20:25:50.8536688Z 
2025-05-07T20:25:50.8943462Z gds-tools-1.13.0.11  | 37.9 MB   | ######1    |  61% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.8943840Z 
2025-05-07T20:25:50.8943844Z 
2025-05-07T20:25:50.8943848Z 
2025-05-07T20:25:50.8943852Z 
2025-05-07T20:25:50.8943856Z 
2025-05-07T20:25:50.8943859Z 
2025-05-07T20:25:50.8943887Z 
2025-05-07T20:25:50.8943892Z 
2025-05-07T20:25:50.9540399Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########4 |  95% [A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9540806Z 
2025-05-07T20:25:50.9540829Z 
2025-05-07T20:25:50.9540834Z 
2025-05-07T20:25:50.9540838Z 
2025-05-07T20:25:50.9540842Z 
2025-05-07T20:25:50.9540845Z 
2025-05-07T20:25:50.9540858Z 
2025-05-07T20:25:50.9540862Z 
2025-05-07T20:25:50.9540866Z 
2025-05-07T20:25:50.9542764Z 
2025-05-07T20:25:50.9727202Z gds-tools-1.13.0.11  | 37.9 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9727660Z 
2025-05-07T20:25:50.9727666Z 
2025-05-07T20:25:50.9727671Z 
2025-05-07T20:25:50.9727676Z 
2025-05-07T20:25:50.9727681Z 
2025-05-07T20:25:50.9727686Z 
2025-05-07T20:25:50.9727692Z 
2025-05-07T20:25:50.9727697Z 
2025-05-07T20:25:50.9729413Z 
2025-05-07T20:25:50.9951346Z libcurand-10.3.9.55  | 43.6 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:50.9951809Z 
2025-05-07T20:25:50.9951815Z 
2025-05-07T20:25:50.9952128Z 
2025-05-07T20:25:50.9952132Z 
2025-05-07T20:25:50.9952135Z 
2025-05-07T20:25:50.9952139Z 
2025-05-07T20:25:50.9952143Z 
2025-05-07T20:25:50.9952146Z 
2025-05-07T20:25:51.0546604Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########9 | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:51.0546947Z 
2025-05-07T20:25:51.0546951Z 
2025-05-07T20:25:51.0546955Z 
2025-05-07T20:25:51.0546959Z 
2025-05-07T20:25:51.0546962Z 
2025-05-07T20:25:51.0546966Z 
2025-05-07T20:25:51.0546970Z 
2025-05-07T20:25:51.0546974Z 
2025-05-07T20:25:51.0546977Z 
2025-05-07T20:25:51.0548030Z 
2025-05-07T20:25:51.0731982Z gds-tools-1.13.0.11  | 37.9 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.0732287Z 
2025-05-07T20:25:51.0732291Z 
2025-05-07T20:25:51.0732295Z 
2025-05-07T20:25:51.0732299Z 
2025-05-07T20:25:51.0732302Z 
2025-05-07T20:25:51.0732306Z 
2025-05-07T20:25:51.0732310Z 
2025-05-07T20:25:51.0732314Z 
2025-05-07T20:25:51.0732369Z 
2025-05-07T20:25:51.1554275Z libcurand-10.3.9.55  | 43.6 MB   | ######1    |  62% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.1554712Z 
2025-05-07T20:25:51.1554717Z 
2025-05-07T20:25:51.1554735Z 
2025-05-07T20:25:51.1554739Z 
2025-05-07T20:25:51.1554743Z 
2025-05-07T20:25:51.1554756Z 
2025-05-07T20:25:51.1554760Z 
2025-05-07T20:25:51.1554764Z 
2025-05-07T20:25:51.1554770Z 
2025-05-07T20:25:51.1558597Z 
2025-05-07T20:25:51.1732336Z gds-tools-1.13.0.11  | 37.9 MB   | ########4  |  85% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.1732734Z 
2025-05-07T20:25:51.1732739Z 
2025-05-07T20:25:51.1732742Z 
2025-05-07T20:25:51.1732746Z 
2025-05-07T20:25:51.1732750Z 
2025-05-07T20:25:51.1732754Z 
2025-05-07T20:25:51.1732758Z 
2025-05-07T20:25:51.1732761Z 
2025-05-07T20:25:51.1736861Z 
2025-05-07T20:25:51.2602047Z libcurand-10.3.9.55  | 43.6 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.2602430Z 
2025-05-07T20:25:51.2602434Z 
2025-05-07T20:25:51.2602438Z 
2025-05-07T20:25:51.2602441Z 
2025-05-07T20:25:51.2602445Z 
2025-05-07T20:25:51.2602449Z 
2025-05-07T20:25:51.2602474Z 
2025-05-07T20:25:51.2602478Z 
2025-05-07T20:25:51.2602482Z 
2025-05-07T20:25:51.2605625Z 
2025-05-07T20:25:51.2744863Z gds-tools-1.13.0.11  | 37.9 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.2745309Z 
2025-05-07T20:25:51.2745316Z 
2025-05-07T20:25:51.2745321Z 
2025-05-07T20:25:51.2745326Z 
2025-05-07T20:25:51.2745332Z 
2025-05-07T20:25:51.2745337Z 
2025-05-07T20:25:51.2745343Z 
2025-05-07T20:25:51.2745348Z 
2025-05-07T20:25:51.2748040Z 
2025-05-07T20:25:51.3802480Z libcurand-10.3.9.55  | 43.6 MB   | #######5   |  75% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.3802795Z 
2025-05-07T20:25:51.3802799Z 
2025-05-07T20:25:51.3802803Z 
2025-05-07T20:25:51.3802806Z 
2025-05-07T20:25:51.3802810Z 
2025-05-07T20:25:51.3802814Z 
2025-05-07T20:25:51.3802817Z 
2025-05-07T20:25:51.3802821Z 
2025-05-07T20:25:51.3805148Z 
2025-05-07T20:25:51.4994099Z libcurand-10.3.9.55  | 43.6 MB   | ########1  |  81% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.4994415Z 
2025-05-07T20:25:51.4994442Z 
2025-05-07T20:25:51.4994446Z 
2025-05-07T20:25:51.4994449Z 
2025-05-07T20:25:51.4994462Z 
2025-05-07T20:25:51.4994466Z 
2025-05-07T20:25:51.4994470Z 
2025-05-07T20:25:51.4994483Z 
2025-05-07T20:25:51.4996161Z 
2025-05-07T20:25:51.5998666Z libcurand-10.3.9.55  | 43.6 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:51.5998975Z 
2025-05-07T20:25:51.5998979Z 
2025-05-07T20:25:51.5998983Z 
2025-05-07T20:25:51.5998987Z 
2025-05-07T20:25:51.5998990Z 
2025-05-07T20:25:51.5998994Z 
2025-05-07T20:25:51.5998998Z 
2025-05-07T20:25:51.5999001Z 
2025-05-07T20:25:51.6004142Z 
2025-05-07T20:25:52.6450405Z libcurand-10.3.9.55  | 43.6 MB   | #########4 |  94% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.6450891Z 
2025-05-07T20:25:52.6450897Z 
2025-05-07T20:25:52.6450902Z 
2025-05-07T20:25:52.6450909Z 
2025-05-07T20:25:52.6450914Z 
2025-05-07T20:25:52.6450931Z 
2025-05-07T20:25:52.6450936Z 
2025-05-07T20:25:52.6450942Z 
2025-05-07T20:25:52.6450949Z 
2025-05-07T20:25:52.6450955Z 
2025-05-07T20:25:52.6911392Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.6911854Z 
2025-05-07T20:25:52.6911858Z 
2025-05-07T20:25:52.6912088Z 
2025-05-07T20:25:52.6912094Z 
2025-05-07T20:25:52.6912098Z 
2025-05-07T20:25:52.6912101Z 
2025-05-07T20:25:52.6912105Z 
2025-05-07T20:25:52.6912109Z 
2025-05-07T20:25:52.6912113Z 
2025-05-07T20:25:52.6912117Z 
2025-05-07T20:25:52.6914717Z 
2025-05-07T20:25:52.7916031Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.7916430Z 
2025-05-07T20:25:52.7916435Z 
2025-05-07T20:25:52.7916439Z 
2025-05-07T20:25:52.7916443Z 
2025-05-07T20:25:52.7916447Z 
2025-05-07T20:25:52.7916452Z 
2025-05-07T20:25:52.7916456Z 
2025-05-07T20:25:52.7916460Z 
2025-05-07T20:25:52.7916466Z 
2025-05-07T20:25:52.7916470Z 
2025-05-07T20:25:52.7918211Z 
2025-05-07T20:25:52.8084217Z libnvjitlink-12.8.61 | 28.7 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:52.8084676Z 
2025-05-07T20:25:52.8084699Z 
2025-05-07T20:25:52.8084703Z 
2025-05-07T20:25:52.8084707Z 
2025-05-07T20:25:52.8084711Z 
2025-05-07T20:25:52.8084843Z 
2025-05-07T20:25:52.8974872Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:52.8975227Z 
2025-05-07T20:25:52.8975232Z 
2025-05-07T20:25:52.8975236Z 
2025-05-07T20:25:52.8975240Z 
2025-05-07T20:25:52.8975244Z 
2025-05-07T20:25:52.8975248Z 
2025-05-07T20:25:52.8975260Z 
2025-05-07T20:25:52.8975265Z 
2025-05-07T20:25:52.8975269Z 
2025-05-07T20:25:52.8975273Z 
2025-05-07T20:25:52.8977300Z 
2025-05-07T20:25:53.0080166Z libnvjitlink-12.8.61 | 28.7 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.0080536Z 
2025-05-07T20:25:53.0080541Z 
2025-05-07T20:25:53.0080545Z 
2025-05-07T20:25:53.0080549Z 
2025-05-07T20:25:53.0080553Z 
2025-05-07T20:25:53.0080557Z 
2025-05-07T20:25:53.0080576Z 
2025-05-07T20:25:53.0080582Z 
2025-05-07T20:25:53.0080587Z 
2025-05-07T20:25:53.0080593Z 
2025-05-07T20:25:53.0083179Z 
2025-05-07T20:25:53.1063836Z libnvjitlink-12.8.61 | 28.7 MB   | ###4       |  34% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.1064208Z 
2025-05-07T20:25:53.1064213Z 
2025-05-07T20:25:53.1064242Z 
2025-05-07T20:25:53.1064247Z 
2025-05-07T20:25:53.1064251Z 
2025-05-07T20:25:53.1064255Z 
2025-05-07T20:25:53.1064259Z 
2025-05-07T20:25:53.1069680Z 
2025-05-07T20:25:53.1145165Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:53.1145542Z 
2025-05-07T20:25:53.1145548Z 
2025-05-07T20:25:53.1145554Z 
2025-05-07T20:25:53.1145559Z 
2025-05-07T20:25:53.1145565Z 
2025-05-07T20:25:53.1145570Z 
2025-05-07T20:25:53.1145576Z 
2025-05-07T20:25:53.1145583Z 
2025-05-07T20:25:53.1145588Z 
2025-05-07T20:25:53.1145594Z 
2025-05-07T20:25:53.1145600Z 
2025-05-07T20:25:53.1497976Z libnvjitlink-12.8.61 | 28.7 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.1498431Z 
2025-05-07T20:25:53.1498436Z 
2025-05-07T20:25:53.1498440Z 
2025-05-07T20:25:53.1498461Z 
2025-05-07T20:25:53.1498465Z 
2025-05-07T20:25:53.1498470Z 
2025-05-07T20:25:53.1498474Z 
2025-05-07T20:25:53.1498478Z 
2025-05-07T20:25:53.1498482Z 
2025-05-07T20:25:53.1498496Z 
2025-05-07T20:25:53.1498502Z 
2025-05-07T20:25:53.1500944Z 
2025-05-07T20:25:53.2189324Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.2189828Z 
2025-05-07T20:25:53.2189834Z 
2025-05-07T20:25:53.2189850Z 
2025-05-07T20:25:53.2189856Z 
2025-05-07T20:25:53.2189862Z 
2025-05-07T20:25:53.2189868Z 
2025-05-07T20:25:53.2189874Z 
2025-05-07T20:25:53.2189879Z 
2025-05-07T20:25:53.2189885Z 
2025-05-07T20:25:53.2189891Z 
2025-05-07T20:25:53.2191809Z 
2025-05-07T20:25:53.2500001Z libnvjitlink-12.8.61 | 28.7 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.2500484Z 
2025-05-07T20:25:53.2500490Z 
2025-05-07T20:25:53.2500496Z 
2025-05-07T20:25:53.2500501Z 
2025-05-07T20:25:53.2500507Z 
2025-05-07T20:25:53.2500513Z 
2025-05-07T20:25:53.2500518Z 
2025-05-07T20:25:53.2500813Z 
2025-05-07T20:25:53.2500819Z 
2025-05-07T20:25:53.2500826Z 
2025-05-07T20:25:53.2500832Z 
2025-05-07T20:25:53.2501508Z 
2025-05-07T20:25:53.3281918Z cuda-nvcc-tools-12.8 | 24.5 MB   | #2         |  12% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.3282344Z 
2025-05-07T20:25:53.3282349Z 
2025-05-07T20:25:53.3282353Z 
2025-05-07T20:25:53.3282358Z 
2025-05-07T20:25:53.3282362Z 
2025-05-07T20:25:53.3282366Z 
2025-05-07T20:25:53.3282370Z 
2025-05-07T20:25:53.3282374Z 
2025-05-07T20:25:53.3282379Z 
2025-05-07T20:25:53.3282383Z 
2025-05-07T20:25:53.3282387Z 
2025-05-07T20:25:53.3497053Z libnvjitlink-12.8.61 | 28.7 MB   | ######5    |  66% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.3505152Z 
2025-05-07T20:25:53.3505160Z 
2025-05-07T20:25:53.3527776Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:25:53.3528151Z 
2025-05-07T20:25:53.3528158Z 
2025-05-07T20:25:53.3528164Z 
2025-05-07T20:25:53.3528169Z 
2025-05-07T20:25:53.3528175Z 
2025-05-07T20:25:53.3528200Z 
2025-05-07T20:25:53.3528206Z 
2025-05-07T20:25:53.3528212Z 
2025-05-07T20:25:53.3528217Z 
2025-05-07T20:25:53.3528223Z 
2025-05-07T20:25:53.3528229Z 
2025-05-07T20:25:53.3528243Z 
2025-05-07T20:25:53.3837571Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##4        |  25% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.3837927Z 
2025-05-07T20:25:53.3837932Z 
2025-05-07T20:25:53.3837936Z 
2025-05-07T20:25:53.3837940Z 
2025-05-07T20:25:53.3837944Z 
2025-05-07T20:25:53.3837948Z 
2025-05-07T20:25:53.3837952Z 
2025-05-07T20:25:53.3837956Z 
2025-05-07T20:25:53.3837960Z 
2025-05-07T20:25:53.4330352Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.4330740Z 
2025-05-07T20:25:53.4330746Z 
2025-05-07T20:25:53.4330752Z 
2025-05-07T20:25:53.4330758Z 
2025-05-07T20:25:53.4330763Z 
2025-05-07T20:25:53.4330783Z 
2025-05-07T20:25:53.4330789Z 
2025-05-07T20:25:53.4330795Z 
2025-05-07T20:25:53.4330801Z 
2025-05-07T20:25:53.4330806Z 
2025-05-07T20:25:53.4330831Z 
2025-05-07T20:25:53.4379020Z libnvjitlink-12.8.61 | 28.7 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.4379520Z 
2025-05-07T20:25:53.4379524Z 
2025-05-07T20:25:53.4379538Z 
2025-05-07T20:25:53.4379542Z 
2025-05-07T20:25:53.4379547Z 
2025-05-07T20:25:53.4379551Z 
2025-05-07T20:25:53.4379555Z 
2025-05-07T20:25:53.4379559Z 
2025-05-07T20:25:53.4379563Z 
2025-05-07T20:25:53.4379567Z 
2025-05-07T20:25:53.4379571Z 
2025-05-07T20:25:53.4379575Z 
2025-05-07T20:25:53.4379579Z 
2025-05-07T20:25:53.4706639Z python-3.10.13       | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.4706961Z 
2025-05-07T20:25:53.4706965Z 
2025-05-07T20:25:53.4706969Z 
2025-05-07T20:25:53.4706973Z 
2025-05-07T20:25:53.4706977Z 
2025-05-07T20:25:53.4706981Z 
2025-05-07T20:25:53.4706985Z 
2025-05-07T20:25:53.4706989Z 
2025-05-07T20:25:53.4706993Z 
2025-05-07T20:25:53.4706997Z 
2025-05-07T20:25:53.4707001Z 
2025-05-07T20:25:53.4712215Z 
2025-05-07T20:25:53.5379939Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.5380397Z 
2025-05-07T20:25:53.5380402Z 
2025-05-07T20:25:53.5380406Z 
2025-05-07T20:25:53.5380420Z 
2025-05-07T20:25:53.5380434Z 
2025-05-07T20:25:53.5380438Z 
2025-05-07T20:25:53.5380442Z 
2025-05-07T20:25:53.5380446Z 
2025-05-07T20:25:53.5380450Z 
2025-05-07T20:25:53.5380454Z 
2025-05-07T20:25:53.5380458Z 
2025-05-07T20:25:53.5380462Z 
2025-05-07T20:25:53.5382090Z 
2025-05-07T20:25:53.5481736Z python-3.10.13       | 24.5 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.5482204Z 
2025-05-07T20:25:53.5482211Z 
2025-05-07T20:25:53.5482217Z 
2025-05-07T20:25:53.5482222Z 
2025-05-07T20:25:53.5482228Z 
2025-05-07T20:25:53.5482234Z 
2025-05-07T20:25:53.5482239Z 
2025-05-07T20:25:53.5482245Z 
2025-05-07T20:25:53.5482250Z 
2025-05-07T20:25:53.5482256Z 
2025-05-07T20:25:53.5482261Z 
2025-05-07T20:25:53.5777285Z libnvjitlink-12.8.61 | 28.7 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.5777997Z 
2025-05-07T20:25:53.5778002Z 
2025-05-07T20:25:53.5778006Z 
2025-05-07T20:25:53.5778010Z 
2025-05-07T20:25:53.5778014Z 
2025-05-07T20:25:53.5778338Z 
2025-05-07T20:25:53.5778344Z 
2025-05-07T20:25:53.5778348Z 
2025-05-07T20:25:53.5778352Z 
2025-05-07T20:25:53.5778356Z 
2025-05-07T20:25:53.5778360Z 
2025-05-07T20:25:53.5779721Z 
2025-05-07T20:25:53.6412828Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.6413265Z 
2025-05-07T20:25:53.6413270Z 
2025-05-07T20:25:53.6413274Z 
2025-05-07T20:25:53.6413288Z 
2025-05-07T20:25:53.6413292Z 
2025-05-07T20:25:53.6413296Z 
2025-05-07T20:25:53.6413300Z 
2025-05-07T20:25:53.6413304Z 
2025-05-07T20:25:53.6413308Z 
2025-05-07T20:25:53.6413313Z 
2025-05-07T20:25:53.6413317Z 
2025-05-07T20:25:53.6413321Z 
2025-05-07T20:25:53.6414603Z 
2025-05-07T20:25:53.6659582Z python-3.10.13       | 24.5 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.6659942Z 
2025-05-07T20:25:53.6659947Z 
2025-05-07T20:25:53.6659951Z 
2025-05-07T20:25:53.6659955Z 
2025-05-07T20:25:53.6659960Z 
2025-05-07T20:25:53.6659964Z 
2025-05-07T20:25:53.6659976Z 
2025-05-07T20:25:53.6659981Z 
2025-05-07T20:25:53.6659985Z 
2025-05-07T20:25:53.6659989Z 
2025-05-07T20:25:53.6659994Z 
2025-05-07T20:25:53.6845720Z libnvjitlink-12.8.61 | 28.7 MB   | #########4 |  95% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.6846066Z 
2025-05-07T20:25:53.6846071Z 
2025-05-07T20:25:53.6850775Z 
2025-05-07T20:25:53.6853133Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:25:53.6853435Z 
2025-05-07T20:25:53.6853439Z 
2025-05-07T20:25:53.6853444Z 
2025-05-07T20:25:53.6853448Z 
2025-05-07T20:25:53.6853470Z 
2025-05-07T20:25:53.6853474Z 
2025-05-07T20:25:53.6853479Z 
2025-05-07T20:25:53.6853492Z 
2025-05-07T20:25:53.6853504Z 
2025-05-07T20:25:53.6853524Z 
2025-05-07T20:25:53.6853528Z 
2025-05-07T20:25:53.6853624Z 
2025-05-07T20:25:53.7424829Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.7425365Z 
2025-05-07T20:25:53.7425372Z 
2025-05-07T20:25:53.7425388Z 
2025-05-07T20:25:53.7425403Z 
2025-05-07T20:25:53.7425408Z 
2025-05-07T20:25:53.7425414Z 
2025-05-07T20:25:53.7425420Z 
2025-05-07T20:25:53.7425426Z 
2025-05-07T20:25:53.7425432Z 
2025-05-07T20:25:53.7425438Z 
2025-05-07T20:25:53.7425443Z 
2025-05-07T20:25:53.7425449Z 
2025-05-07T20:25:53.7431856Z 
2025-05-07T20:25:53.7990869Z python-3.10.13       | 24.5 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.7991372Z 
2025-05-07T20:25:53.7991385Z 
2025-05-07T20:25:53.7991396Z 
2025-05-07T20:25:53.7991405Z 
2025-05-07T20:25:53.7991412Z 
2025-05-07T20:25:53.7991420Z 
2025-05-07T20:25:53.7991427Z 
2025-05-07T20:25:53.7991434Z 
2025-05-07T20:25:53.7991441Z 
2025-05-07T20:25:53.7991448Z 
2025-05-07T20:25:53.7991455Z 
2025-05-07T20:25:53.7993948Z 
2025-05-07T20:25:53.8427463Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.8427956Z 
2025-05-07T20:25:53.8427960Z 
2025-05-07T20:25:53.8427965Z 
2025-05-07T20:25:53.8427969Z 
2025-05-07T20:25:53.8427981Z 
2025-05-07T20:25:53.8427985Z 
2025-05-07T20:25:53.8427989Z 
2025-05-07T20:25:53.8427993Z 
2025-05-07T20:25:53.8427997Z 
2025-05-07T20:25:53.8428001Z 
2025-05-07T20:25:53.8428006Z 
2025-05-07T20:25:53.8428010Z 
2025-05-07T20:25:53.8431027Z 
2025-05-07T20:25:53.8991023Z python-3.10.13       | 24.5 MB   | ####3      |  43% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.8991449Z 
2025-05-07T20:25:53.8991454Z 
2025-05-07T20:25:53.8991458Z 
2025-05-07T20:25:53.8991471Z 
2025-05-07T20:25:53.8991475Z 
2025-05-07T20:25:53.8991479Z 
2025-05-07T20:25:53.8991483Z 
2025-05-07T20:25:53.8991487Z 
2025-05-07T20:25:53.8991491Z 
2025-05-07T20:25:53.8991495Z 
2025-05-07T20:25:53.8991499Z 
2025-05-07T20:25:53.9001960Z 
2025-05-07T20:25:53.9432708Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:53.9433379Z 
2025-05-07T20:25:53.9433384Z 
2025-05-07T20:25:53.9433388Z 
2025-05-07T20:25:53.9433392Z 
2025-05-07T20:25:53.9433396Z 
2025-05-07T20:25:53.9433400Z 
2025-05-07T20:25:53.9433640Z 
2025-05-07T20:25:53.9433646Z 
2025-05-07T20:25:53.9433650Z 
2025-05-07T20:25:53.9433654Z 
2025-05-07T20:25:53.9433658Z 
2025-05-07T20:25:53.9433662Z 
2025-05-07T20:25:53.9438050Z 
2025-05-07T20:25:54.0020833Z python-3.10.13       | 24.5 MB   | #####5     |  55% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.0021212Z 
2025-05-07T20:25:54.0021228Z 
2025-05-07T20:25:54.0021234Z 
2025-05-07T20:25:54.0021240Z 
2025-05-07T20:25:54.0021245Z 
2025-05-07T20:25:54.0021251Z 
2025-05-07T20:25:54.0021257Z 
2025-05-07T20:25:54.0021262Z 
2025-05-07T20:25:54.0021268Z 
2025-05-07T20:25:54.0021273Z 
2025-05-07T20:25:54.0021278Z 
2025-05-07T20:25:54.0027553Z 
2025-05-07T20:25:54.0438905Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########  |  91% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.0441510Z 
2025-05-07T20:25:54.0557593Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:25:54.0558021Z 
2025-05-07T20:25:54.0558027Z 
2025-05-07T20:25:54.0558033Z 
2025-05-07T20:25:54.0558050Z 
2025-05-07T20:25:54.0558056Z 
2025-05-07T20:25:54.0558062Z 
2025-05-07T20:25:54.0558067Z 
2025-05-07T20:25:54.0558073Z 
2025-05-07T20:25:54.0558078Z 
2025-05-07T20:25:54.0558084Z 
2025-05-07T20:25:54.0558090Z 
2025-05-07T20:25:54.0558095Z 
2025-05-07T20:25:54.0559854Z 
2025-05-07T20:25:54.0918383Z python-3.10.13       | 24.5 MB   | ######6    |  67% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.0918718Z 
2025-05-07T20:25:54.0918722Z 
2025-05-07T20:25:54.0918727Z 
2025-05-07T20:25:54.0918731Z 
2025-05-07T20:25:54.0918735Z 
2025-05-07T20:25:54.0918739Z 
2025-05-07T20:25:54.0918743Z 
2025-05-07T20:25:54.0918747Z 
2025-05-07T20:25:54.0918751Z 
2025-05-07T20:25:54.0918755Z 
2025-05-07T20:25:54.0918771Z 
2025-05-07T20:25:54.0918775Z 
2025-05-07T20:25:54.0918779Z 
2025-05-07T20:25:54.0921509Z 
2025-05-07T20:25:54.1558025Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.1558395Z 
2025-05-07T20:25:54.1558400Z 
2025-05-07T20:25:54.1558413Z 
2025-05-07T20:25:54.1558417Z 
2025-05-07T20:25:54.1558421Z 
2025-05-07T20:25:54.1558425Z 
2025-05-07T20:25:54.1558429Z 
2025-05-07T20:25:54.1558433Z 
2025-05-07T20:25:54.1558437Z 
2025-05-07T20:25:54.1558441Z 
2025-05-07T20:25:54.1558445Z 
2025-05-07T20:25:54.1558449Z 
2025-05-07T20:25:54.1559829Z 
2025-05-07T20:25:54.1921596Z python-3.10.13       | 24.5 MB   | #######8   |  78% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.1921939Z 
2025-05-07T20:25:54.1921943Z 
2025-05-07T20:25:54.1921947Z 
2025-05-07T20:25:54.1921951Z 
2025-05-07T20:25:54.1921955Z 
2025-05-07T20:25:54.1921959Z 
2025-05-07T20:25:54.1921963Z 
2025-05-07T20:25:54.1921967Z 
2025-05-07T20:25:54.1921971Z 
2025-05-07T20:25:54.1921975Z 
2025-05-07T20:25:54.1921979Z 
2025-05-07T20:25:54.1921983Z 
2025-05-07T20:25:54.1921987Z 
2025-05-07T20:25:54.1923494Z 
2025-05-07T20:25:54.2587072Z cuda-nvvm-tools-12.8 | 23.5 MB   | #2         |  12% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.2587533Z 
2025-05-07T20:25:54.2587549Z 
2025-05-07T20:25:54.2587561Z 
2025-05-07T20:25:54.2587565Z 
2025-05-07T20:25:54.2587569Z 
2025-05-07T20:25:54.2587573Z 
2025-05-07T20:25:54.2587578Z 
2025-05-07T20:25:54.2587582Z 
2025-05-07T20:25:54.2587586Z 
2025-05-07T20:25:54.2587590Z 
2025-05-07T20:25:54.2587594Z 
2025-05-07T20:25:54.2587598Z 
2025-05-07T20:25:54.2588851Z 
2025-05-07T20:25:54.3037291Z python-3.10.13       | 24.5 MB   | ########9  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.3037763Z 
2025-05-07T20:25:54.3037770Z 
2025-05-07T20:25:54.3037776Z 
2025-05-07T20:25:54.3037782Z 
2025-05-07T20:25:54.3037787Z 
2025-05-07T20:25:54.3037793Z 
2025-05-07T20:25:54.3037808Z 
2025-05-07T20:25:54.3037814Z 
2025-05-07T20:25:54.3037820Z 
2025-05-07T20:25:54.3037826Z 
2025-05-07T20:25:54.3037831Z 
2025-05-07T20:25:54.3038040Z 
2025-05-07T20:25:54.3038044Z 
2025-05-07T20:25:54.3038048Z 
2025-05-07T20:25:54.4038803Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##4        |  25% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.4039312Z 
2025-05-07T20:25:54.4039318Z 
2025-05-07T20:25:54.4039324Z 
2025-05-07T20:25:54.4039329Z 
2025-05-07T20:25:54.4039334Z 
2025-05-07T20:25:54.4039339Z 
2025-05-07T20:25:54.4039345Z 
2025-05-07T20:25:54.4039350Z 
2025-05-07T20:25:54.4039355Z 
2025-05-07T20:25:54.4039360Z 
2025-05-07T20:25:54.4039366Z 
2025-05-07T20:25:54.4039371Z 
2025-05-07T20:25:54.4039377Z 
2025-05-07T20:25:54.4041412Z 
2025-05-07T20:25:54.5065277Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.5065656Z 
2025-05-07T20:25:54.5065661Z 
2025-05-07T20:25:54.5065665Z 
2025-05-07T20:25:54.5065669Z 
2025-05-07T20:25:54.5065673Z 
2025-05-07T20:25:54.5065677Z 
2025-05-07T20:25:54.5065681Z 
2025-05-07T20:25:54.5065685Z 
2025-05-07T20:25:54.5065692Z 
2025-05-07T20:25:54.5065723Z 
2025-05-07T20:25:54.5065727Z 
2025-05-07T20:25:54.5065731Z 
2025-05-07T20:25:54.5065735Z 
2025-05-07T20:25:54.5066099Z 
2025-05-07T20:25:54.6094638Z cuda-nvvm-tools-12.8 | 23.5 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.6095021Z 
2025-05-07T20:25:54.6095026Z 
2025-05-07T20:25:54.6095030Z 
2025-05-07T20:25:54.6095034Z 
2025-05-07T20:25:54.6095038Z 
2025-05-07T20:25:54.6095042Z 
2025-05-07T20:25:54.6095046Z 
2025-05-07T20:25:54.6095050Z 
2025-05-07T20:25:54.6095054Z 
2025-05-07T20:25:54.6095058Z 
2025-05-07T20:25:54.6095062Z 
2025-05-07T20:25:54.6095066Z 
2025-05-07T20:25:54.6095070Z 
2025-05-07T20:25:54.6095767Z 
2025-05-07T20:25:54.7103773Z cuda-nvvm-tools-12.8 | 23.5 MB   | ######4    |  64% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.7104315Z 
2025-05-07T20:25:54.7104319Z 
2025-05-07T20:25:54.7104325Z 
2025-05-07T20:25:54.7104330Z 
2025-05-07T20:25:54.7104334Z 
2025-05-07T20:25:54.7104338Z 
2025-05-07T20:25:54.7104342Z 
2025-05-07T20:25:54.7104379Z 
2025-05-07T20:25:54.7104383Z 
2025-05-07T20:25:54.7104387Z 
2025-05-07T20:25:54.7104391Z 
2025-05-07T20:25:54.7104395Z 
2025-05-07T20:25:54.7104406Z 
2025-05-07T20:25:54.7104421Z 
2025-05-07T20:25:54.8105081Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8105497Z 
2025-05-07T20:25:54.8105503Z 
2025-05-07T20:25:54.8105509Z 
2025-05-07T20:25:54.8105529Z 
2025-05-07T20:25:54.8105535Z 
2025-05-07T20:25:54.8105541Z 
2025-05-07T20:25:54.8105547Z 
2025-05-07T20:25:54.8105553Z 
2025-05-07T20:25:54.8105559Z 
2025-05-07T20:25:54.8105564Z 
2025-05-07T20:25:54.8105570Z 
2025-05-07T20:25:54.8105576Z 
2025-05-07T20:25:54.8105582Z 
2025-05-07T20:25:54.8105635Z 
2025-05-07T20:25:54.8220744Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8221242Z 
2025-05-07T20:25:54.8221248Z 
2025-05-07T20:25:54.8221253Z 
2025-05-07T20:25:54.8221259Z 
2025-05-07T20:25:54.8221265Z 
2025-05-07T20:25:54.8221311Z 
2025-05-07T20:25:54.8221317Z 
2025-05-07T20:25:54.8221323Z 
2025-05-07T20:25:54.8221328Z 
2025-05-07T20:25:54.8221334Z 
2025-05-07T20:25:54.8221339Z 
2025-05-07T20:25:54.8853473Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.8853902Z 
2025-05-07T20:25:54.8853908Z 
2025-05-07T20:25:54.8853914Z 
2025-05-07T20:25:54.8853919Z 
2025-05-07T20:25:54.8853924Z 
2025-05-07T20:25:54.8853930Z 
2025-05-07T20:25:54.8853936Z 
2025-05-07T20:25:54.8853941Z 
2025-05-07T20:25:54.8853947Z 
2025-05-07T20:25:54.8853952Z 
2025-05-07T20:25:54.8853958Z 
2025-05-07T20:25:54.8853964Z 
2025-05-07T20:25:54.8853969Z 
2025-05-07T20:25:54.8853975Z 
2025-05-07T20:25:54.8855506Z 
2025-05-07T20:25:54.9854703Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:54.9855202Z 
2025-05-07T20:25:54.9855207Z 
2025-05-07T20:25:54.9855211Z 
2025-05-07T20:25:54.9855215Z 
2025-05-07T20:25:54.9855230Z 
2025-05-07T20:25:54.9855519Z 
2025-05-07T20:25:54.9855524Z 
2025-05-07T20:25:54.9855528Z 
2025-05-07T20:25:54.9855532Z 
2025-05-07T20:25:54.9855536Z 
2025-05-07T20:25:54.9855540Z 
2025-05-07T20:25:54.9855681Z 
2025-05-07T20:25:54.9855685Z 
2025-05-07T20:25:54.9855689Z 
2025-05-07T20:25:54.9855702Z 
2025-05-07T20:25:55.0204429Z cuda-nvvm-impl-12.8. | 20.8 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.0204951Z 
2025-05-07T20:25:55.0204958Z 
2025-05-07T20:25:55.0204964Z 
2025-05-07T20:25:55.0204970Z 
2025-05-07T20:25:55.0204976Z 
2025-05-07T20:25:55.0204997Z 
2025-05-07T20:25:55.0205004Z 
2025-05-07T20:25:55.0205009Z 
2025-05-07T20:25:55.0205015Z 
2025-05-07T20:25:55.0205020Z 
2025-05-07T20:25:55.0205024Z 
2025-05-07T20:25:55.0205028Z 
2025-05-07T20:25:55.0618083Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.0618460Z 
2025-05-07T20:25:55.0618465Z 
2025-05-07T20:25:55.0618469Z 
2025-05-07T20:25:55.0618473Z 
2025-05-07T20:25:55.0618502Z 
2025-05-07T20:25:55.0618506Z 
2025-05-07T20:25:55.0618511Z 
2025-05-07T20:25:55.0618515Z 
2025-05-07T20:25:55.0618519Z 
2025-05-07T20:25:55.0618523Z 
2025-05-07T20:25:55.0618537Z 
2025-05-07T20:25:55.0618542Z 
2025-05-07T20:25:55.0618546Z 
2025-05-07T20:25:55.0618551Z 
2025-05-07T20:25:55.0618555Z 
2025-05-07T20:25:55.0619116Z 
2025-05-07T20:25:55.1537113Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.1537575Z 
2025-05-07T20:25:55.1537580Z 
2025-05-07T20:25:55.1537584Z 
2025-05-07T20:25:55.1537588Z 
2025-05-07T20:25:55.1537592Z 
2025-05-07T20:25:55.1537597Z 
2025-05-07T20:25:55.1537601Z 
2025-05-07T20:25:55.1537605Z 
2025-05-07T20:25:55.1537609Z 
2025-05-07T20:25:55.1537622Z 
2025-05-07T20:25:55.1537626Z 
2025-05-07T20:25:55.1537630Z 
2025-05-07T20:25:55.1537634Z 
2025-05-07T20:25:55.1537638Z 
2025-05-07T20:25:55.1538967Z 
2025-05-07T20:25:55.1619477Z cuda-nvvm-impl-12.8. | 20.8 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.1620053Z 
2025-05-07T20:25:55.1620060Z 
2025-05-07T20:25:55.1620065Z 
2025-05-07T20:25:55.1620071Z 
2025-05-07T20:25:55.1620089Z 
2025-05-07T20:25:55.1620095Z 
2025-05-07T20:25:55.1620100Z 
2025-05-07T20:25:55.1620106Z 
2025-05-07T20:25:55.1620112Z 
2025-05-07T20:25:55.1620117Z 
2025-05-07T20:25:55.1620123Z 
2025-05-07T20:25:55.1620128Z 
2025-05-07T20:25:55.1620134Z 
2025-05-07T20:25:55.1620139Z 
2025-05-07T20:25:55.1620145Z 
2025-05-07T20:25:55.1621793Z 
2025-05-07T20:25:55.2546172Z cuda-nvcc-dev_linux- | 12.7 MB   | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.2546628Z 
2025-05-07T20:25:55.2546634Z 
2025-05-07T20:25:55.2546640Z 
2025-05-07T20:25:55.2546646Z 
2025-05-07T20:25:55.2546652Z 
2025-05-07T20:25:55.2546658Z 
2025-05-07T20:25:55.2546665Z 
2025-05-07T20:25:55.2546671Z 
2025-05-07T20:25:55.2546690Z 
2025-05-07T20:25:55.2546696Z 
2025-05-07T20:25:55.2546702Z 
2025-05-07T20:25:55.2546707Z 
2025-05-07T20:25:55.2546748Z 
2025-05-07T20:25:55.2546754Z 
2025-05-07T20:25:55.2546761Z 
2025-05-07T20:25:55.2694376Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####2      |  42% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.2694859Z 
2025-05-07T20:25:55.2694866Z 
2025-05-07T20:25:55.2694871Z 
2025-05-07T20:25:55.2694877Z 
2025-05-07T20:25:55.2694883Z 
2025-05-07T20:25:55.2694889Z 
2025-05-07T20:25:55.2694894Z 
2025-05-07T20:25:55.2694900Z 
2025-05-07T20:25:55.2694906Z 
2025-05-07T20:25:55.2694911Z 
2025-05-07T20:25:55.2694917Z 
2025-05-07T20:25:55.2694923Z 
2025-05-07T20:25:55.2694929Z 
2025-05-07T20:25:55.2694934Z 
2025-05-07T20:25:55.2694940Z 
2025-05-07T20:25:55.2697428Z 
2025-05-07T20:25:55.2735986Z cuda-nvcc-dev_linux- | 12.7 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.2736462Z 
2025-05-07T20:25:55.2736467Z 
2025-05-07T20:25:55.2736471Z 
2025-05-07T20:25:55.2736483Z 
2025-05-07T20:25:55.2736487Z 
2025-05-07T20:25:55.2736491Z 
2025-05-07T20:25:55.2736495Z 
2025-05-07T20:25:55.2736739Z 
2025-05-07T20:25:55.2736743Z 
2025-05-07T20:25:55.2736747Z 
2025-05-07T20:25:55.2736752Z 
2025-05-07T20:25:55.2736756Z 
2025-05-07T20:25:55.2736760Z 
2025-05-07T20:25:55.3127542Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3128028Z 
2025-05-07T20:25:55.3128035Z 
2025-05-07T20:25:55.3128040Z 
2025-05-07T20:25:55.3128046Z 
2025-05-07T20:25:55.3128051Z 
2025-05-07T20:25:55.3128057Z 
2025-05-07T20:25:55.3128063Z 
2025-05-07T20:25:55.3128069Z 
2025-05-07T20:25:55.3128075Z 
2025-05-07T20:25:55.3128081Z 
2025-05-07T20:25:55.3128087Z 
2025-05-07T20:25:55.3128103Z 
2025-05-07T20:25:55.3128117Z 
2025-05-07T20:25:55.3128123Z 
2025-05-07T20:25:55.3128129Z 
2025-05-07T20:25:55.3128135Z 
2025-05-07T20:25:55.3129468Z 
2025-05-07T20:25:55.3554627Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3555032Z 
2025-05-07T20:25:55.3555037Z 
2025-05-07T20:25:55.3555068Z 
2025-05-07T20:25:55.3555072Z 
2025-05-07T20:25:55.3555076Z 
2025-05-07T20:25:55.3555080Z 
2025-05-07T20:25:55.3555085Z 
2025-05-07T20:25:55.3555091Z 
2025-05-07T20:25:55.3555113Z 
2025-05-07T20:25:55.3555119Z 
2025-05-07T20:25:55.3555125Z 
2025-05-07T20:25:55.3555130Z 
2025-05-07T20:25:55.3555135Z 
2025-05-07T20:25:55.3555141Z 
2025-05-07T20:25:55.3555397Z 
2025-05-07T20:25:55.3796203Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####4     |  54% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.3796654Z 
2025-05-07T20:25:55.3796659Z 
2025-05-07T20:25:55.3796663Z 
2025-05-07T20:25:55.3796668Z 
2025-05-07T20:25:55.3796671Z 
2025-05-07T20:25:55.3796675Z 
2025-05-07T20:25:55.3796679Z 
2025-05-07T20:25:55.3796683Z 
2025-05-07T20:25:55.3796687Z 
2025-05-07T20:25:55.3796700Z 
2025-05-07T20:25:55.3796704Z 
2025-05-07T20:25:55.3796708Z 
2025-05-07T20:25:55.3796712Z 
2025-05-07T20:25:55.3796716Z 
2025-05-07T20:25:55.3796720Z 
2025-05-07T20:25:55.3798061Z 
2025-05-07T20:25:55.4129497Z cuda-nvcc-dev_linux- | 12.7 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.4130077Z 
2025-05-07T20:25:55.4130084Z 
2025-05-07T20:25:55.4130089Z 
2025-05-07T20:25:55.4130111Z 
2025-05-07T20:25:55.4130117Z 
2025-05-07T20:25:55.4130123Z 
2025-05-07T20:25:55.4130128Z 
2025-05-07T20:25:55.4130134Z 
2025-05-07T20:25:55.4130139Z 
2025-05-07T20:25:55.4130145Z 
2025-05-07T20:25:55.4130150Z 
2025-05-07T20:25:55.4130156Z 
2025-05-07T20:25:55.4130161Z 
2025-05-07T20:25:55.4130167Z 
2025-05-07T20:25:55.4130172Z 
2025-05-07T20:25:55.4130178Z 
2025-05-07T20:25:55.4131747Z 
2025-05-07T20:25:55.4599074Z cuda-sanitizer-api-1 | 8.8 MB    | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.4599651Z 
2025-05-07T20:25:55.4599657Z 
2025-05-07T20:25:55.4599663Z 
2025-05-07T20:25:55.4599669Z 
2025-05-07T20:25:55.4599674Z 
2025-05-07T20:25:55.4599680Z 
2025-05-07T20:25:55.4599685Z 
2025-05-07T20:25:55.4599690Z 
2025-05-07T20:25:55.4599698Z 
2025-05-07T20:25:55.4599734Z 
2025-05-07T20:25:55.4599741Z 
2025-05-07T20:25:55.4599747Z 
2025-05-07T20:25:55.4599752Z 
2025-05-07T20:25:55.4599758Z 
2025-05-07T20:25:55.4600145Z 
2025-05-07T20:25:55.4880080Z cuda-nvvm-impl-12.8. | 20.8 MB   | ######6    |  67% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.4880459Z 
2025-05-07T20:25:55.4880463Z 
2025-05-07T20:25:55.4880467Z 
2025-05-07T20:25:55.4880471Z 
2025-05-07T20:25:55.4880476Z 
2025-05-07T20:25:55.4880480Z 
2025-05-07T20:25:55.4880484Z 
2025-05-07T20:25:55.4880488Z 
2025-05-07T20:25:55.4880501Z 
2025-05-07T20:25:55.4880505Z 
2025-05-07T20:25:55.4880509Z 
2025-05-07T20:25:55.4880513Z 
2025-05-07T20:25:55.4880517Z 
2025-05-07T20:25:55.4880521Z 
2025-05-07T20:25:55.4880525Z 
2025-05-07T20:25:55.4882182Z 
2025-05-07T20:25:55.5133711Z cuda-nvcc-dev_linux- | 12.7 MB   | ########6  |  87% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.5134161Z 
2025-05-07T20:25:55.5134168Z 
2025-05-07T20:25:55.5134173Z 
2025-05-07T20:25:55.5134179Z 
2025-05-07T20:25:55.5134482Z 
2025-05-07T20:25:55.5134488Z 
2025-05-07T20:25:55.5134495Z 
2025-05-07T20:25:55.5134500Z 
2025-05-07T20:25:55.5134506Z 
2025-05-07T20:25:55.5134512Z 
2025-05-07T20:25:55.5134702Z 
2025-05-07T20:25:55.5134711Z 
2025-05-07T20:25:55.5134717Z 
2025-05-07T20:25:55.5134722Z 
2025-05-07T20:25:55.5134728Z 
2025-05-07T20:25:55.5134733Z 
2025-05-07T20:25:55.5134739Z 
2025-05-07T20:25:55.5633108Z cuda-sanitizer-api-1 | 8.8 MB    | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.5633794Z 
2025-05-07T20:25:55.5633815Z 
2025-05-07T20:25:55.5633820Z 
2025-05-07T20:25:55.5633826Z 
2025-05-07T20:25:55.5633842Z 
2025-05-07T20:25:55.5633848Z 
2025-05-07T20:25:55.5633854Z 
2025-05-07T20:25:55.5633860Z 
2025-05-07T20:25:55.5633865Z 
2025-05-07T20:25:55.5633872Z 
2025-05-07T20:25:55.5633877Z 
2025-05-07T20:25:55.5633882Z 
2025-05-07T20:25:55.5633889Z 
2025-05-07T20:25:55.5633895Z 
2025-05-07T20:25:55.5637050Z 
2025-05-07T20:25:55.6166594Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######8   |  79% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.6167131Z 
2025-05-07T20:25:55.6167136Z 
2025-05-07T20:25:55.6167141Z 
2025-05-07T20:25:55.6167160Z 
2025-05-07T20:25:55.6167164Z 
2025-05-07T20:25:55.6167168Z 
2025-05-07T20:25:55.6167172Z 
2025-05-07T20:25:55.6167176Z 
2025-05-07T20:25:55.6167180Z 
2025-05-07T20:25:55.6167184Z 
2025-05-07T20:25:55.6167188Z 
2025-05-07T20:25:55.6167192Z 
2025-05-07T20:25:55.6167196Z 
2025-05-07T20:25:55.6167200Z 
2025-05-07T20:25:55.6167204Z 
2025-05-07T20:25:55.6167208Z 
2025-05-07T20:25:55.6167222Z 
2025-05-07T20:25:55.6640564Z cuda-sanitizer-api-1 | 8.8 MB    | ########4  |  84% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.6641085Z 
2025-05-07T20:25:55.6641091Z 
2025-05-07T20:25:55.6641108Z 
2025-05-07T20:25:55.6641115Z 
2025-05-07T20:25:55.6641120Z 
2025-05-07T20:25:55.6641126Z 
2025-05-07T20:25:55.6641132Z 
2025-05-07T20:25:55.6641137Z 
2025-05-07T20:25:55.6641143Z 
2025-05-07T20:25:55.6641148Z 
2025-05-07T20:25:55.6641183Z 
2025-05-07T20:25:55.6641190Z 
2025-05-07T20:25:55.6641195Z 
2025-05-07T20:25:55.6641201Z 
2025-05-07T20:25:55.6646338Z 
2025-05-07T20:25:55.7946978Z cuda-nvvm-impl-12.8. | 20.8 MB   | #########2 |  92% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.7947437Z 
2025-05-07T20:25:55.7947442Z 
2025-05-07T20:25:55.7947446Z 
2025-05-07T20:25:55.7947450Z 
2025-05-07T20:25:55.7947454Z 
2025-05-07T20:25:55.7947466Z 
2025-05-07T20:25:55.7947470Z 
2025-05-07T20:25:55.7947474Z 
2025-05-07T20:25:55.7947478Z 
2025-05-07T20:25:55.7947482Z 
2025-05-07T20:25:55.7947486Z 
2025-05-07T20:25:55.7947490Z 
2025-05-07T20:25:55.7947494Z 
2025-05-07T20:25:55.7949163Z 
2025-05-07T20:25:55.8022800Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.8023230Z 
2025-05-07T20:25:55.8023236Z 
2025-05-07T20:25:55.8023242Z 
2025-05-07T20:25:55.8023248Z 
2025-05-07T20:25:55.8023253Z 
2025-05-07T20:25:55.8023258Z 
2025-05-07T20:25:55.8023264Z 
2025-05-07T20:25:55.8023290Z 
2025-05-07T20:25:55.8023297Z 
2025-05-07T20:25:55.8025183Z 
2025-05-07T20:25:55.8689268Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.8689624Z 
2025-05-07T20:25:55.8689629Z 
2025-05-07T20:25:55.8689642Z 
2025-05-07T20:25:55.8689646Z 
2025-05-07T20:25:55.8689650Z 
2025-05-07T20:25:55.8689654Z 
2025-05-07T20:25:55.8689657Z 
2025-05-07T20:25:55.8689662Z 
2025-05-07T20:25:55.8689665Z 
2025-05-07T20:25:55.8689673Z 
2025-05-07T20:25:55.8689679Z 
2025-05-07T20:25:55.8689684Z 
2025-05-07T20:25:55.8689690Z 
2025-05-07T20:25:55.8689696Z 
2025-05-07T20:25:55.8689701Z 
2025-05-07T20:25:55.8689707Z 
2025-05-07T20:25:55.8689713Z 
2025-05-07T20:25:55.8691780Z 
2025-05-07T20:25:55.9698204Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.9698703Z 
2025-05-07T20:25:55.9698709Z 
2025-05-07T20:25:55.9698717Z 
2025-05-07T20:25:55.9698721Z 
2025-05-07T20:25:55.9698981Z 
2025-05-07T20:25:55.9698985Z 
2025-05-07T20:25:55.9698989Z 
2025-05-07T20:25:55.9698993Z 
2025-05-07T20:25:55.9698997Z 
2025-05-07T20:25:55.9699001Z 
2025-05-07T20:25:55.9699018Z 
2025-05-07T20:25:55.9699187Z 
2025-05-07T20:25:55.9699192Z 
2025-05-07T20:25:55.9699196Z 
2025-05-07T20:25:55.9699200Z 
2025-05-07T20:25:55.9699204Z 
2025-05-07T20:25:55.9699208Z 
2025-05-07T20:25:55.9699730Z 
2025-05-07T20:25:55.9888645Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ######3    |  64% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:55.9889028Z 
2025-05-07T20:25:55.9889032Z 
2025-05-07T20:25:55.9889036Z 
2025-05-07T20:25:55.9889040Z 
2025-05-07T20:25:55.9889045Z 
2025-05-07T20:25:55.9889049Z 
2025-05-07T20:25:55.9889053Z 
2025-05-07T20:25:55.9889057Z 
2025-05-07T20:25:55.9889061Z 
2025-05-07T20:25:55.9889065Z 
2025-05-07T20:25:55.9889069Z 
2025-05-07T20:25:55.9889073Z 
2025-05-07T20:25:55.9889077Z 
2025-05-07T20:25:55.9889081Z 
2025-05-07T20:25:55.9889085Z 
2025-05-07T20:25:55.9889089Z 
2025-05-07T20:25:55.9889111Z 
2025-05-07T20:25:56.0327317Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.0327814Z 
2025-05-07T20:25:56.0327844Z 
2025-05-07T20:25:56.0327851Z 
2025-05-07T20:25:56.0327867Z 
2025-05-07T20:25:56.0327873Z 
2025-05-07T20:25:56.0327879Z 
2025-05-07T20:25:56.0327885Z 
2025-05-07T20:25:56.0327890Z 
2025-05-07T20:25:56.0327896Z 
2025-05-07T20:25:56.0327902Z 
2025-05-07T20:25:56.0327908Z 
2025-05-07T20:25:56.0327914Z 
2025-05-07T20:25:56.0327920Z 
2025-05-07T20:25:56.0327926Z 
2025-05-07T20:25:56.0327932Z 
2025-05-07T20:25:56.0327936Z 
2025-05-07T20:25:56.0327940Z 
2025-05-07T20:25:56.0327944Z 
2025-05-07T20:25:56.0327948Z 
2025-05-07T20:25:56.0474617Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.0475050Z 
2025-05-07T20:25:56.0475054Z 
2025-05-07T20:25:56.0475058Z 
2025-05-07T20:25:56.0475063Z 
2025-05-07T20:25:56.0475069Z 
2025-05-07T20:25:56.0475076Z 
2025-05-07T20:25:56.0475098Z 
2025-05-07T20:25:56.0475104Z 
2025-05-07T20:25:56.0475110Z 
2025-05-07T20:25:56.0475115Z 
2025-05-07T20:25:56.0475121Z 
2025-05-07T20:25:56.0475127Z 
2025-05-07T20:25:56.0475152Z 
2025-05-07T20:25:56.0475158Z 
2025-05-07T20:25:56.0475163Z 
2025-05-07T20:25:56.0477482Z 
2025-05-07T20:25:56.1333264Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.1333776Z 
2025-05-07T20:25:56.1333783Z 
2025-05-07T20:25:56.1333789Z 
2025-05-07T20:25:56.1333795Z 
2025-05-07T20:25:56.1333800Z 
2025-05-07T20:25:56.1333806Z 
2025-05-07T20:25:56.1333812Z 
2025-05-07T20:25:56.1333818Z 
2025-05-07T20:25:56.1333824Z 
2025-05-07T20:25:56.1333830Z 
2025-05-07T20:25:56.1333836Z 
2025-05-07T20:25:56.1333842Z 
2025-05-07T20:25:56.1333848Z 
2025-05-07T20:25:56.1333854Z 
2025-05-07T20:25:56.1333860Z 
2025-05-07T20:25:56.1333866Z 
2025-05-07T20:25:56.1333872Z 
2025-05-07T20:25:56.1333878Z 
2025-05-07T20:25:56.1333885Z 
2025-05-07T20:25:56.2337481Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.2337924Z 
2025-05-07T20:25:56.2337931Z 
2025-05-07T20:25:56.2337936Z 
2025-05-07T20:25:56.2337956Z 
2025-05-07T20:25:56.2337961Z 
2025-05-07T20:25:56.2337967Z 
2025-05-07T20:25:56.2337973Z 
2025-05-07T20:25:56.2337978Z 
2025-05-07T20:25:56.2337995Z 
2025-05-07T20:25:56.2338001Z 
2025-05-07T20:25:56.2338006Z 
2025-05-07T20:25:56.2338012Z 
2025-05-07T20:25:56.2338017Z 
2025-05-07T20:25:56.2338023Z 
2025-05-07T20:25:56.2338029Z 
2025-05-07T20:25:56.2338034Z 
2025-05-07T20:25:56.2338040Z 
2025-05-07T20:25:56.2347331Z 
2025-05-07T20:25:56.2959601Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.2960009Z 
2025-05-07T20:25:56.2960014Z 
2025-05-07T20:25:56.2960019Z 
2025-05-07T20:25:56.2960023Z 
2025-05-07T20:25:56.2960028Z 
2025-05-07T20:25:56.2960032Z 
2025-05-07T20:25:56.2960036Z 
2025-05-07T20:25:56.2960040Z 
2025-05-07T20:25:56.2960045Z 
2025-05-07T20:25:56.2960312Z 
2025-05-07T20:25:56.2960316Z 
2025-05-07T20:25:56.2960320Z 
2025-05-07T20:25:56.2960324Z 
2025-05-07T20:25:56.2960328Z 
2025-05-07T20:25:56.2960341Z 
2025-05-07T20:25:56.2960482Z 
2025-05-07T20:25:56.2960487Z 
2025-05-07T20:25:56.2960491Z 
2025-05-07T20:25:56.2960494Z 
2025-05-07T20:25:56.4898334Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:56.4898745Z 
2025-05-07T20:25:56.4898749Z 
2025-05-07T20:25:56.4898753Z 
2025-05-07T20:25:56.4898757Z 
2025-05-07T20:25:56.4898761Z 
2025-05-07T20:25:56.4898766Z 
2025-05-07T20:25:56.4898770Z 
2025-05-07T20:25:56.4898775Z 
2025-05-07T20:25:56.4898779Z 
2025-05-07T20:25:56.4898782Z 
2025-05-07T20:25:56.4898786Z 
2025-05-07T20:25:56.4898790Z 
2025-05-07T20:25:56.4898794Z 
2025-05-07T20:25:56.4898799Z 
2025-05-07T20:25:56.4902288Z 
2025-05-07T20:25:58.5953721Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:58.5954113Z 
2025-05-07T20:25:58.5954144Z 
2025-05-07T20:25:58.5954148Z 
2025-05-07T20:25:58.5954162Z 
2025-05-07T20:25:58.5954166Z 
2025-05-07T20:25:58.5954170Z 
2025-05-07T20:25:58.5954174Z 
2025-05-07T20:25:59.1439055Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:59.1447832Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:25:59.1448159Z 
2025-05-07T20:25:59.1448165Z 
2025-05-07T20:25:59.1448171Z 
2025-05-07T20:25:59.1448177Z 
2025-05-07T20:25:59.1448182Z 
2025-05-07T20:25:59.1448187Z 
2025-05-07T20:25:59.1448203Z 
2025-05-07T20:25:59.1448208Z 
2025-05-07T20:25:59.2069239Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:59.2069680Z 
2025-05-07T20:25:59.2069685Z 
2025-05-07T20:25:59.2069693Z 
2025-05-07T20:25:59.2069711Z 
2025-05-07T20:25:59.2069717Z 
2025-05-07T20:25:59.3187088Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:59.3187483Z 
2025-05-07T20:25:59.3187499Z 
2025-05-07T20:25:59.3187534Z 
2025-05-07T20:25:59.3187540Z 
2025-05-07T20:25:59.3187545Z 
2025-05-07T20:25:59.3187551Z 
2025-05-07T20:25:59.3187556Z 
2025-05-07T20:25:59.3187561Z 
2025-05-07T20:25:59.3187580Z 
2025-05-07T20:25:59.6309449Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.6309962Z 
2025-05-07T20:25:59.6309968Z 
2025-05-07T20:25:59.6309974Z 
2025-05-07T20:25:59.6309980Z 
2025-05-07T20:25:59.6309986Z 
2025-05-07T20:25:59.6309991Z 
2025-05-07T20:25:59.6309997Z 
2025-05-07T20:25:59.6310003Z 
2025-05-07T20:25:59.6310009Z 
2025-05-07T20:25:59.6310015Z 
2025-05-07T20:25:59.6310020Z 
2025-05-07T20:25:59.7453614Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:59.7454180Z 
2025-05-07T20:25:59.7454186Z 
2025-05-07T20:25:59.7454192Z 
2025-05-07T20:25:59.7454197Z 
2025-05-07T20:25:59.7454204Z 
2025-05-07T20:25:59.7454209Z 
2025-05-07T20:25:59.7454215Z 
2025-05-07T20:25:59.7454220Z 
2025-05-07T20:25:59.7454226Z 
2025-05-07T20:25:59.7454268Z 
2025-05-07T20:25:59.7454272Z 
2025-05-07T20:25:59.7454276Z 
2025-05-07T20:26:00.2400381Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:00.2400772Z 
2025-05-07T20:26:00.2400777Z 
2025-05-07T20:26:00.2400781Z 
2025-05-07T20:26:00.2400785Z 
2025-05-07T20:26:00.2400789Z 
2025-05-07T20:26:00.2400793Z 
2025-05-07T20:26:00.2400797Z 
2025-05-07T20:26:00.2400801Z 
2025-05-07T20:26:00.2400805Z 
2025-05-07T20:26:00.2400809Z 
2025-05-07T20:26:00.2400813Z 
2025-05-07T20:26:00.2400817Z 
2025-05-07T20:26:00.2400821Z 
2025-05-07T20:26:00.2400825Z 
2025-05-07T20:26:00.5472560Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:00.5473065Z 
2025-05-07T20:26:00.5473070Z 
2025-05-07T20:26:00.5473084Z 
2025-05-07T20:26:00.5473088Z 
2025-05-07T20:26:00.5473092Z 
2025-05-07T20:26:00.5473096Z 
2025-05-07T20:26:00.5473101Z 
2025-05-07T20:26:00.5473105Z 
2025-05-07T20:26:00.5473110Z 
2025-05-07T20:26:00.5473381Z 
2025-05-07T20:26:00.5473385Z 
2025-05-07T20:26:00.5473390Z 
2025-05-07T20:26:00.5473394Z 
2025-05-07T20:26:00.5473398Z 
2025-05-07T20:26:00.5473402Z 
2025-05-07T20:26:00.5473680Z 
2025-05-07T20:26:00.5473685Z 
2025-05-07T20:26:00.9877898Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:00.9878305Z 
2025-05-07T20:26:00.9878309Z 
2025-05-07T20:26:00.9878313Z 
2025-05-07T20:26:00.9878317Z 
2025-05-07T20:26:00.9878321Z 
2025-05-07T20:26:00.9878325Z 
2025-05-07T20:26:00.9878329Z 
2025-05-07T20:26:00.9878333Z 
2025-05-07T20:26:00.9878337Z 
2025-05-07T20:26:00.9878341Z 
2025-05-07T20:26:00.9878345Z 
2025-05-07T20:26:00.9878349Z 
2025-05-07T20:26:00.9878359Z 
2025-05-07T20:26:00.9878363Z 
2025-05-07T20:26:00.9878367Z 
2025-05-07T20:26:00.9878371Z 
2025-05-07T20:26:01.0413242Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.0413651Z 
2025-05-07T20:26:01.0413691Z 
2025-05-07T20:26:01.0413696Z 
2025-05-07T20:26:01.0413701Z 
2025-05-07T20:26:01.0413707Z 
2025-05-07T20:26:01.0413711Z 
2025-05-07T20:26:01.0413716Z 
2025-05-07T20:26:01.0413736Z 
2025-05-07T20:26:01.0413741Z 
2025-05-07T20:26:01.0413745Z 
2025-05-07T20:26:01.0413750Z 
2025-05-07T20:26:01.0413755Z 
2025-05-07T20:26:01.0413760Z 
2025-05-07T20:26:01.0413765Z 
2025-05-07T20:26:01.0413770Z 
2025-05-07T20:26:01.0413774Z 
2025-05-07T20:26:01.0413777Z 
2025-05-07T20:26:01.0413781Z 
2025-05-07T20:26:01.1584877Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.1585280Z 
2025-05-07T20:26:01.1585284Z 
2025-05-07T20:26:01.1585288Z 
2025-05-07T20:26:01.1585291Z 
2025-05-07T20:26:01.1585295Z 
2025-05-07T20:26:01.1585299Z 
2025-05-07T20:26:01.1585303Z 
2025-05-07T20:26:01.1585307Z 
2025-05-07T20:26:01.1585311Z 
2025-05-07T20:26:01.1585326Z 
2025-05-07T20:26:01.1585330Z 
2025-05-07T20:26:01.1585333Z 
2025-05-07T20:26:01.1585365Z 
2025-05-07T20:26:01.2691964Z python-3.10.13       | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.2692305Z 
2025-05-07T20:26:01.2692309Z 
2025-05-07T20:26:01.2692331Z 
2025-05-07T20:26:01.2692335Z 
2025-05-07T20:26:01.2692338Z 
2025-05-07T20:26:01.2692342Z 
2025-05-07T20:26:01.2692346Z 
2025-05-07T20:26:01.2692350Z 
2025-05-07T20:26:01.2692354Z 
2025-05-07T20:26:01.2692358Z 
2025-05-07T20:26:01.2692362Z 
2025-05-07T20:26:01.2692366Z 
2025-05-07T20:26:01.2692370Z 
2025-05-07T20:26:01.2692373Z 
2025-05-07T20:26:01.2692377Z 
2025-05-07T20:26:01.2692381Z 
2025-05-07T20:26:01.2692385Z 
2025-05-07T20:26:01.2692389Z 
2025-05-07T20:26:01.2692397Z 
2025-05-07T20:26:01.7006025Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:01.7006490Z 
2025-05-07T20:26:01.7006496Z 
2025-05-07T20:26:01.7006502Z 
2025-05-07T20:26:01.7006507Z 
2025-05-07T20:26:01.7006512Z 
2025-05-07T20:26:01.7006518Z 
2025-05-07T20:26:01.7006524Z 
2025-05-07T20:26:01.7006560Z 
2025-05-07T20:26:01.7006565Z 
2025-05-07T20:26:01.7006570Z 
2025-05-07T20:26:01.7006576Z 
2025-05-07T20:26:01.7006581Z 
2025-05-07T20:26:01.7006586Z 
2025-05-07T20:26:01.7006608Z 
2025-05-07T20:26:01.7006793Z 
2025-05-07T20:26:06.3758297Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:06.3758681Z 
2025-05-07T20:26:07.6899787Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:07.6909331Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:07.6909712Z 
2025-05-07T20:26:07.6909718Z 
2025-05-07T20:26:07.6909725Z 
2025-05-07T20:26:07.6909731Z 
2025-05-07T20:26:07.6909738Z 
2025-05-07T20:26:07.6909744Z 
2025-05-07T20:26:07.6909750Z 
2025-05-07T20:26:07.6909756Z 
2025-05-07T20:26:07.6909765Z 
2025-05-07T20:26:07.6909772Z 
2025-05-07T20:26:07.6909778Z 
2025-05-07T20:26:07.6909785Z 
2025-05-07T20:26:07.6909789Z 
2025-05-07T20:26:07.6909801Z 
2025-05-07T20:26:07.6909805Z 
2025-05-07T20:26:07.6910115Z 
2025-05-07T20:26:07.6910119Z 
2025-05-07T20:26:07.6910123Z 
2025-05-07T20:26:07.6910127Z 
2025-05-07T20:26:07.6910234Z                       
2025-05-07T20:26:07.6910857Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6911226Z                                                      
2025-05-07T20:26:07.6911453Z 
2025-05-07T20:26:07.6911659Z                                                      [A
2025-05-07T20:26:07.6911888Z 
2025-05-07T20:26:07.6911892Z 
2025-05-07T20:26:07.6912072Z                                                      [A[A
2025-05-07T20:26:07.6912299Z 
2025-05-07T20:26:07.6912303Z 
2025-05-07T20:26:07.6912314Z 
2025-05-07T20:26:07.6912563Z                                                      [A[A[A
2025-05-07T20:26:07.6912885Z 
2025-05-07T20:26:07.6912890Z 
2025-05-07T20:26:07.6912894Z 
2025-05-07T20:26:07.6912898Z 
2025-05-07T20:26:07.6913406Z                                                      [A[A[A[A
2025-05-07T20:26:07.6913793Z 
2025-05-07T20:26:07.6913810Z 
2025-05-07T20:26:07.6913819Z 
2025-05-07T20:26:07.6913823Z 
2025-05-07T20:26:07.6913828Z 
2025-05-07T20:26:07.6914341Z                                                      [A[A[A[A[A
2025-05-07T20:26:07.6914603Z 
2025-05-07T20:26:07.6914619Z 
2025-05-07T20:26:07.6914625Z 
2025-05-07T20:26:07.6914629Z 
2025-05-07T20:26:07.6914633Z 
2025-05-07T20:26:07.6914637Z 
2025-05-07T20:26:07.6915222Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:07.6915510Z 
2025-05-07T20:26:07.6915516Z 
2025-05-07T20:26:07.6915529Z 
2025-05-07T20:26:07.6915535Z 
2025-05-07T20:26:07.6915540Z 
2025-05-07T20:26:07.6915546Z 
2025-05-07T20:26:07.6915552Z 
2025-05-07T20:26:07.6916360Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:07.6916789Z 
2025-05-07T20:26:07.6916797Z 
2025-05-07T20:26:07.6916804Z 
2025-05-07T20:26:07.6916810Z 
2025-05-07T20:26:07.6916816Z 
2025-05-07T20:26:07.6916822Z 
2025-05-07T20:26:07.6916839Z 
2025-05-07T20:26:07.6916866Z 
2025-05-07T20:26:07.6917268Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6917683Z 
2025-05-07T20:26:07.6917698Z 
2025-05-07T20:26:07.6917704Z 
2025-05-07T20:26:07.6917720Z 
2025-05-07T20:26:07.6917726Z 
2025-05-07T20:26:07.6917732Z 
2025-05-07T20:26:07.6917737Z 
2025-05-07T20:26:07.6917743Z 
2025-05-07T20:26:07.6917749Z 
2025-05-07T20:26:07.6918250Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6918599Z 
2025-05-07T20:26:07.6918605Z 
2025-05-07T20:26:07.6918620Z 
2025-05-07T20:26:07.6918626Z 
2025-05-07T20:26:07.6918632Z 
2025-05-07T20:26:07.6918637Z 
2025-05-07T20:26:07.6918643Z 
2025-05-07T20:26:07.6918648Z 
2025-05-07T20:26:07.6918654Z 
2025-05-07T20:26:07.6918664Z 
2025-05-07T20:26:07.6919137Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6919565Z 
2025-05-07T20:26:07.6919571Z 
2025-05-07T20:26:07.6919577Z 
2025-05-07T20:26:07.6919595Z 
2025-05-07T20:26:07.6919601Z 
2025-05-07T20:26:07.6919608Z 
2025-05-07T20:26:07.6919614Z 
2025-05-07T20:26:07.6919620Z 
2025-05-07T20:26:07.6919626Z 
2025-05-07T20:26:07.6919639Z 
2025-05-07T20:26:07.6919656Z 
2025-05-07T20:26:07.6920023Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6920451Z 
2025-05-07T20:26:07.6920457Z 
2025-05-07T20:26:07.6920463Z 
2025-05-07T20:26:07.6920468Z 
2025-05-07T20:26:07.6920473Z 
2025-05-07T20:26:07.6920486Z 
2025-05-07T20:26:07.6920492Z 
2025-05-07T20:26:07.6920498Z 
2025-05-07T20:26:07.6920504Z 
2025-05-07T20:26:07.6920509Z 
2025-05-07T20:26:07.6920515Z 
2025-05-07T20:26:07.6920521Z 
2025-05-07T20:26:07.6921158Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6921563Z 
2025-05-07T20:26:07.6921570Z 
2025-05-07T20:26:07.6921577Z 
2025-05-07T20:26:07.6921582Z 
2025-05-07T20:26:07.6921588Z 
2025-05-07T20:26:07.6921594Z 
2025-05-07T20:26:07.6921768Z 
2025-05-07T20:26:07.6921775Z 
2025-05-07T20:26:07.6921782Z 
2025-05-07T20:26:07.6921800Z 
2025-05-07T20:26:07.6921806Z 
2025-05-07T20:26:07.6921813Z 
2025-05-07T20:26:07.6921915Z 
2025-05-07T20:26:07.6922300Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6922717Z 
2025-05-07T20:26:07.6922724Z 
2025-05-07T20:26:07.6922731Z 
2025-05-07T20:26:07.6922737Z 
2025-05-07T20:26:07.6922743Z 
2025-05-07T20:26:07.6922750Z 
2025-05-07T20:26:07.6922756Z 
2025-05-07T20:26:07.6922762Z 
2025-05-07T20:26:07.6922769Z 
2025-05-07T20:26:07.6922775Z 
2025-05-07T20:26:07.6922781Z 
2025-05-07T20:26:07.6922788Z 
2025-05-07T20:26:07.6922794Z 
2025-05-07T20:26:07.6922801Z 
2025-05-07T20:26:07.6923187Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6923602Z 
2025-05-07T20:26:07.6923609Z 
2025-05-07T20:26:07.6923616Z 
2025-05-07T20:26:07.6923622Z 
2025-05-07T20:26:07.6923628Z 
2025-05-07T20:26:07.6923645Z 
2025-05-07T20:26:07.6923652Z 
2025-05-07T20:26:07.6923658Z 
2025-05-07T20:26:07.6923664Z 
2025-05-07T20:26:07.6923671Z 
2025-05-07T20:26:07.6923677Z 
2025-05-07T20:26:07.6923692Z 
2025-05-07T20:26:07.6923698Z 
2025-05-07T20:26:07.6923705Z 
2025-05-07T20:26:07.6923711Z 
2025-05-07T20:26:07.6924357Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6924766Z 
2025-05-07T20:26:07.6924773Z 
2025-05-07T20:26:07.6924779Z 
2025-05-07T20:26:07.6924785Z 
2025-05-07T20:26:07.6924801Z 
2025-05-07T20:26:07.6924807Z 
2025-05-07T20:26:07.6924814Z 
2025-05-07T20:26:07.6924820Z 
2025-05-07T20:26:07.6924826Z 
2025-05-07T20:26:07.6924833Z 
2025-05-07T20:26:07.6924839Z 
2025-05-07T20:26:07.6924845Z 
2025-05-07T20:26:07.6924852Z 
2025-05-07T20:26:07.6924858Z 
2025-05-07T20:26:07.6924864Z 
2025-05-07T20:26:07.6924871Z 
2025-05-07T20:26:07.6925305Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6925752Z 
2025-05-07T20:26:07.6925761Z 
2025-05-07T20:26:07.6925769Z 
2025-05-07T20:26:07.6925794Z 
2025-05-07T20:26:07.6925800Z 
2025-05-07T20:26:07.6925806Z 
2025-05-07T20:26:07.6925821Z 
2025-05-07T20:26:07.6925827Z 
2025-05-07T20:26:07.6925834Z 
2025-05-07T20:26:07.6925840Z 
2025-05-07T20:26:07.6925846Z 
2025-05-07T20:26:07.6925854Z 
2025-05-07T20:26:07.6925862Z 
2025-05-07T20:26:07.6925881Z 
2025-05-07T20:26:07.6925889Z 
2025-05-07T20:26:07.6925897Z 
2025-05-07T20:26:07.6925905Z 
2025-05-07T20:26:07.6926316Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6926739Z 
2025-05-07T20:26:07.6926746Z 
2025-05-07T20:26:07.6926752Z 
2025-05-07T20:26:07.6926767Z 
2025-05-07T20:26:07.6926773Z 
2025-05-07T20:26:07.6926779Z 
2025-05-07T20:26:07.6926786Z 
2025-05-07T20:26:07.6926792Z 
2025-05-07T20:26:07.6926799Z 
2025-05-07T20:26:07.6926805Z 
2025-05-07T20:26:07.6926811Z 
2025-05-07T20:26:07.6926817Z 
2025-05-07T20:26:07.6926832Z 
2025-05-07T20:26:07.6926838Z 
2025-05-07T20:26:07.6926845Z 
2025-05-07T20:26:07.6926851Z 
2025-05-07T20:26:07.6926857Z 
2025-05-07T20:26:07.6926863Z 
2025-05-07T20:26:07.6927994Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6928410Z 
2025-05-07T20:26:07.6928422Z 
2025-05-07T20:26:07.6928625Z [A
2025-05-07T20:26:07.6928829Z 
2025-05-07T20:26:07.6928839Z 
2025-05-07T20:26:07.6929617Z [A[A
2025-05-07T20:26:07.6929834Z 
2025-05-07T20:26:07.6929840Z 
2025-05-07T20:26:07.6929851Z 
2025-05-07T20:26:07.6930367Z [A[A[A
2025-05-07T20:26:07.6930577Z 
2025-05-07T20:26:07.6930584Z 
2025-05-07T20:26:07.6930599Z 
2025-05-07T20:26:07.6930610Z 
2025-05-07T20:26:07.6931396Z [A[A[A[A
2025-05-07T20:26:07.6931616Z 
2025-05-07T20:26:07.6931623Z 
2025-05-07T20:26:07.6931629Z 
2025-05-07T20:26:07.6931635Z 
2025-05-07T20:26:07.6931654Z 
2025-05-07T20:26:07.6932195Z [A[A[A[A[A
2025-05-07T20:26:07.6932413Z 
2025-05-07T20:26:07.6932672Z 
2025-05-07T20:26:07.6932678Z 
2025-05-07T20:26:07.6932684Z 
2025-05-07T20:26:07.6932694Z 
2025-05-07T20:26:07.6932700Z 
2025-05-07T20:26:07.6933030Z [A[A[A[A[A[A
2025-05-07T20:26:07.6934121Z 
2025-05-07T20:26:07.6934134Z 
2025-05-07T20:26:07.6934140Z 
2025-05-07T20:26:07.6934147Z 
2025-05-07T20:26:07.6934153Z 
2025-05-07T20:26:07.6934171Z 
2025-05-07T20:26:07.6934188Z 
2025-05-07T20:26:07.6934422Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6934686Z 
2025-05-07T20:26:07.6934692Z 
2025-05-07T20:26:07.6934699Z 
2025-05-07T20:26:07.6934705Z 
2025-05-07T20:26:07.6934720Z 
2025-05-07T20:26:07.6934726Z 
2025-05-07T20:26:07.6934733Z 
2025-05-07T20:26:07.6934739Z 
2025-05-07T20:26:07.6934973Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6935242Z 
2025-05-07T20:26:07.6935248Z 
2025-05-07T20:26:07.6935262Z 
2025-05-07T20:26:07.6935268Z 
2025-05-07T20:26:07.6935274Z 
2025-05-07T20:26:07.6935279Z 
2025-05-07T20:26:07.6935285Z 
2025-05-07T20:26:07.6935291Z 
2025-05-07T20:26:07.6935296Z 
2025-05-07T20:26:07.6935659Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6935968Z 
2025-05-07T20:26:07.6935974Z 
2025-05-07T20:26:07.6935980Z 
2025-05-07T20:26:07.6935986Z 
2025-05-07T20:26:07.6935993Z 
2025-05-07T20:26:07.6936012Z 
2025-05-07T20:26:07.6936019Z 
2025-05-07T20:26:07.6936025Z 
2025-05-07T20:26:07.6936031Z 
2025-05-07T20:26:07.6936041Z 
2025-05-07T20:26:07.6936654Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6936866Z 
2025-05-07T20:26:07.6936871Z 
2025-05-07T20:26:07.6936875Z 
2025-05-07T20:26:07.6936880Z 
2025-05-07T20:26:07.6936891Z 
2025-05-07T20:26:07.6936902Z 
2025-05-07T20:26:07.6936906Z 
2025-05-07T20:26:07.6936910Z 
2025-05-07T20:26:07.6936914Z 
2025-05-07T20:26:07.6936918Z 
2025-05-07T20:26:07.6936921Z 
2025-05-07T20:26:07.6937354Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6937599Z 
2025-05-07T20:26:07.6937604Z 
2025-05-07T20:26:07.6937608Z 
2025-05-07T20:26:07.6937612Z 
2025-05-07T20:26:07.6937616Z 
2025-05-07T20:26:07.6937620Z 
2025-05-07T20:26:07.6937624Z 
2025-05-07T20:26:07.6937641Z 
2025-05-07T20:26:07.6937653Z 
2025-05-07T20:26:07.6937657Z 
2025-05-07T20:26:07.6937660Z 
2025-05-07T20:26:07.6937664Z 
2025-05-07T20:26:07.6938093Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6938320Z 
2025-05-07T20:26:07.6938324Z 
2025-05-07T20:26:07.6938328Z 
2025-05-07T20:26:07.6938338Z 
2025-05-07T20:26:07.6938343Z 
2025-05-07T20:26:07.6938347Z 
2025-05-07T20:26:07.6938351Z 
2025-05-07T20:26:07.6938355Z 
2025-05-07T20:26:07.6938359Z 
2025-05-07T20:26:07.6938363Z 
2025-05-07T20:26:07.6938367Z 
2025-05-07T20:26:07.6938371Z 
2025-05-07T20:26:07.6938375Z 
2025-05-07T20:26:07.6938834Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6939092Z 
2025-05-07T20:26:07.6939103Z 
2025-05-07T20:26:07.6939115Z 
2025-05-07T20:26:07.6939119Z 
2025-05-07T20:26:07.6939123Z 
2025-05-07T20:26:07.6939127Z 
2025-05-07T20:26:07.6939131Z 
2025-05-07T20:26:07.6939134Z 
2025-05-07T20:26:07.6939139Z 
2025-05-07T20:26:07.6939142Z 
2025-05-07T20:26:07.6939146Z 
2025-05-07T20:26:07.6939150Z 
2025-05-07T20:26:07.6939162Z 
2025-05-07T20:26:07.6939166Z 
2025-05-07T20:26:07.6939582Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6939831Z 
2025-05-07T20:26:07.6939835Z 
2025-05-07T20:26:07.6939846Z 
2025-05-07T20:26:07.6939856Z 
2025-05-07T20:26:07.6939860Z 
2025-05-07T20:26:07.6939864Z 
2025-05-07T20:26:07.6939868Z 
2025-05-07T20:26:07.6939872Z 
2025-05-07T20:26:07.6939876Z 
2025-05-07T20:26:07.6939880Z 
2025-05-07T20:26:07.6939884Z 
2025-05-07T20:26:07.6939889Z 
2025-05-07T20:26:07.6939892Z 
2025-05-07T20:26:07.6939897Z 
2025-05-07T20:26:07.6939910Z 
2025-05-07T20:26:07.6940430Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6940701Z 
2025-05-07T20:26:07.6940705Z 
2025-05-07T20:26:07.6940710Z 
2025-05-07T20:26:07.6940721Z 
2025-05-07T20:26:07.6940725Z 
2025-05-07T20:26:07.6940729Z 
2025-05-07T20:26:07.6940733Z 
2025-05-07T20:26:07.6940737Z 
2025-05-07T20:26:07.6940741Z 
2025-05-07T20:26:07.6940745Z 
2025-05-07T20:26:07.6940749Z 
2025-05-07T20:26:07.6940753Z 
2025-05-07T20:26:07.6941113Z 
2025-05-07T20:26:07.6941119Z 
2025-05-07T20:26:07.6941125Z 
2025-05-07T20:26:07.6941130Z 
2025-05-07T20:26:07.6941398Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6941818Z 
2025-05-07T20:26:07.6941837Z 
2025-05-07T20:26:07.6941844Z 
2025-05-07T20:26:07.6941849Z 
2025-05-07T20:26:07.6941855Z 
2025-05-07T20:26:07.6941860Z 
2025-05-07T20:26:07.6941866Z 
2025-05-07T20:26:07.6941872Z 
2025-05-07T20:26:07.6941877Z 
2025-05-07T20:26:07.6941883Z 
2025-05-07T20:26:07.6941889Z 
2025-05-07T20:26:07.6941894Z 
2025-05-07T20:26:07.6941900Z 
2025-05-07T20:26:07.6941905Z 
2025-05-07T20:26:07.6941912Z 
2025-05-07T20:26:07.6941917Z 
2025-05-07T20:26:07.6941923Z 
2025-05-07T20:26:07.6942222Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6942558Z 
2025-05-07T20:26:07.6942564Z 
2025-05-07T20:26:07.6942570Z 
2025-05-07T20:26:07.6942576Z 
2025-05-07T20:26:07.6942581Z 
2025-05-07T20:26:07.6942587Z 
2025-05-07T20:26:07.6942592Z 
2025-05-07T20:26:07.6942598Z 
2025-05-07T20:26:07.6942614Z 
2025-05-07T20:26:07.6942619Z 
2025-05-07T20:26:07.6942625Z 
2025-05-07T20:26:07.6942630Z 
2025-05-07T20:26:07.6942646Z 
2025-05-07T20:26:07.6942652Z 
2025-05-07T20:26:07.6942665Z 
2025-05-07T20:26:07.6942670Z 
2025-05-07T20:26:07.6942676Z 
2025-05-07T20:26:07.6942681Z 
2025-05-07T20:26:07.6943351Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6943694Z 
2025-05-07T20:26:07.6943700Z 
2025-05-07T20:26:07.6943861Z [A
2025-05-07T20:26:07.6944029Z 
2025-05-07T20:26:07.6944035Z 
2025-05-07T20:26:07.6944427Z [A[A
2025-05-07T20:26:07.6944597Z 
2025-05-07T20:26:07.6944603Z 
2025-05-07T20:26:07.6944608Z 
2025-05-07T20:26:07.6945226Z [A[A[A
2025-05-07T20:26:07.6945414Z 
2025-05-07T20:26:07.6945420Z 
2025-05-07T20:26:07.6945426Z 
2025-05-07T20:26:07.6945431Z 
2025-05-07T20:26:07.6945678Z [A[A[A[A
2025-05-07T20:26:07.6945894Z 
2025-05-07T20:26:07.6945901Z 
2025-05-07T20:26:07.6945913Z 
2025-05-07T20:26:07.6945920Z 
2025-05-07T20:26:07.6945940Z 
2025-05-07T20:26:07.6946438Z [A[A[A[A[A
2025-05-07T20:26:07.6946651Z 
2025-05-07T20:26:07.6946657Z 
2025-05-07T20:26:07.6946663Z 
2025-05-07T20:26:07.6946673Z 
2025-05-07T20:26:07.6946679Z 
2025-05-07T20:26:07.6946690Z 
2025-05-07T20:26:07.6947103Z [A[A[A[A[A[A
2025-05-07T20:26:07.6947304Z 
2025-05-07T20:26:07.6947316Z 
2025-05-07T20:26:07.6947322Z 
2025-05-07T20:26:07.6947327Z 
2025-05-07T20:26:07.6947333Z 
2025-05-07T20:26:07.6947338Z 
2025-05-07T20:26:07.6947352Z 
2025-05-07T20:26:07.6947841Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6948068Z 
2025-05-07T20:26:07.6948074Z 
2025-05-07T20:26:07.6948080Z 
2025-05-07T20:26:07.6948086Z 
2025-05-07T20:26:07.6948091Z 
2025-05-07T20:26:07.6948097Z 
2025-05-07T20:26:07.6948102Z 
2025-05-07T20:26:07.6948111Z 
2025-05-07T20:26:07.6948506Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6948734Z 
2025-05-07T20:26:07.6948746Z 
2025-05-07T20:26:07.6948750Z 
2025-05-07T20:26:07.6948754Z 
2025-05-07T20:26:07.6948758Z 
2025-05-07T20:26:07.6948769Z 
2025-05-07T20:26:07.6948773Z 
2025-05-07T20:26:07.6948786Z 
2025-05-07T20:26:07.6948790Z 
2025-05-07T20:26:07.6949331Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6949636Z 
2025-05-07T20:26:07.6949643Z 
2025-05-07T20:26:07.6949659Z 
2025-05-07T20:26:07.6949666Z 
2025-05-07T20:26:07.6949672Z 
2025-05-07T20:26:07.6949677Z 
2025-05-07T20:26:07.6949683Z 
2025-05-07T20:26:07.6949689Z 
2025-05-07T20:26:07.6949695Z 
2025-05-07T20:26:07.6949706Z 
2025-05-07T20:26:07.6950000Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6950259Z 
2025-05-07T20:26:07.6950265Z 
2025-05-07T20:26:07.6950270Z 
2025-05-07T20:26:07.6950283Z 
2025-05-07T20:26:07.6950289Z 
2025-05-07T20:26:07.6950294Z 
2025-05-07T20:26:07.6950300Z 
2025-05-07T20:26:07.6950306Z 
2025-05-07T20:26:07.6950311Z 
2025-05-07T20:26:07.6950317Z 
2025-05-07T20:26:07.6950322Z 
2025-05-07T20:26:07.6950780Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6951108Z 
2025-05-07T20:26:07.6951114Z 
2025-05-07T20:26:07.6951120Z 
2025-05-07T20:26:07.6951126Z 
2025-05-07T20:26:07.6951132Z 
2025-05-07T20:26:07.6951307Z 
2025-05-07T20:26:07.6951313Z 
2025-05-07T20:26:07.6951320Z 
2025-05-07T20:26:07.6951326Z 
2025-05-07T20:26:07.6951333Z 
2025-05-07T20:26:07.6951339Z 
2025-05-07T20:26:07.6951456Z 
2025-05-07T20:26:07.6951722Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6952045Z 
2025-05-07T20:26:07.6952052Z 
2025-05-07T20:26:07.6952058Z 
2025-05-07T20:26:07.6952064Z 
2025-05-07T20:26:07.6952070Z 
2025-05-07T20:26:07.6952076Z 
2025-05-07T20:26:07.6952093Z 
2025-05-07T20:26:07.6952099Z 
2025-05-07T20:26:07.6952105Z 
2025-05-07T20:26:07.6952112Z 
2025-05-07T20:26:07.6952118Z 
2025-05-07T20:26:07.6952135Z 
2025-05-07T20:26:07.6952142Z 
2025-05-07T20:26:07.6952414Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6952762Z 
2025-05-07T20:26:07.6952769Z 
2025-05-07T20:26:07.6952774Z 
2025-05-07T20:26:07.6952780Z 
2025-05-07T20:26:07.6952786Z 
2025-05-07T20:26:07.6952792Z 
2025-05-07T20:26:07.6952799Z 
2025-05-07T20:26:07.6952805Z 
2025-05-07T20:26:07.6952811Z 
2025-05-07T20:26:07.6952818Z 
2025-05-07T20:26:07.6952837Z 
2025-05-07T20:26:07.6952843Z 
2025-05-07T20:26:07.6952849Z 
2025-05-07T20:26:07.6952855Z 
2025-05-07T20:26:07.6953149Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6953637Z 
2025-05-07T20:26:07.6953644Z 
2025-05-07T20:26:07.6953651Z 
2025-05-07T20:26:07.6953657Z 
2025-05-07T20:26:07.6953663Z 
2025-05-07T20:26:07.6953670Z 
2025-05-07T20:26:07.6953688Z 
2025-05-07T20:26:07.6953694Z 
2025-05-07T20:26:07.6953700Z 
2025-05-07T20:26:07.6953706Z 
2025-05-07T20:26:07.6953712Z 
2025-05-07T20:26:07.6953718Z 
2025-05-07T20:26:07.6953735Z 
2025-05-07T20:26:07.6953742Z 
2025-05-07T20:26:07.6953748Z 
2025-05-07T20:26:07.6954042Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6954394Z 
2025-05-07T20:26:07.6954401Z 
2025-05-07T20:26:07.6954407Z 
2025-05-07T20:26:07.6954412Z 
2025-05-07T20:26:07.6954428Z 
2025-05-07T20:26:07.6954434Z 
2025-05-07T20:26:07.6954441Z 
2025-05-07T20:26:07.6954447Z 
2025-05-07T20:26:07.6954453Z 
2025-05-07T20:26:07.6954471Z 
2025-05-07T20:26:07.6954486Z 
2025-05-07T20:26:07.6954492Z 
2025-05-07T20:26:07.6954499Z 
2025-05-07T20:26:07.6954505Z 
2025-05-07T20:26:07.6954511Z 
2025-05-07T20:26:07.6954518Z 
2025-05-07T20:26:07.6954821Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6955206Z 
2025-05-07T20:26:07.6955212Z 
2025-05-07T20:26:07.6955228Z 
2025-05-07T20:26:07.6955235Z 
2025-05-07T20:26:07.6955241Z 
2025-05-07T20:26:07.6955248Z 
2025-05-07T20:26:07.6955254Z 
2025-05-07T20:26:07.6955260Z 
2025-05-07T20:26:07.6955266Z 
2025-05-07T20:26:07.6955273Z 
2025-05-07T20:26:07.6955280Z 
2025-05-07T20:26:07.6955286Z 
2025-05-07T20:26:07.6955293Z 
2025-05-07T20:26:07.6955299Z 
2025-05-07T20:26:07.6955306Z 
2025-05-07T20:26:07.6955312Z 
2025-05-07T20:26:07.6955318Z 
2025-05-07T20:26:07.6955631Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6956013Z 
2025-05-07T20:26:07.6956020Z 
2025-05-07T20:26:07.6956027Z 
2025-05-07T20:26:07.6956033Z 
2025-05-07T20:26:07.6956040Z 
2025-05-07T20:26:07.6956055Z 
2025-05-07T20:26:07.6956070Z 
2025-05-07T20:26:07.6956076Z 
2025-05-07T20:26:07.6956083Z 
2025-05-07T20:26:07.6956089Z 
2025-05-07T20:26:07.6956096Z 
2025-05-07T20:26:07.6956102Z 
2025-05-07T20:26:07.6956115Z 
2025-05-07T20:26:07.6956122Z 
2025-05-07T20:26:07.6956136Z 
2025-05-07T20:26:07.6956143Z 
2025-05-07T20:26:07.6956149Z 
2025-05-07T20:26:07.6956155Z 
2025-05-07T20:26:07.6957090Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6957480Z 
2025-05-07T20:26:07.6957487Z 
2025-05-07T20:26:07.6957690Z [A
2025-05-07T20:26:07.6957877Z 
2025-05-07T20:26:07.6957883Z 
2025-05-07T20:26:07.6958253Z [A[A
2025-05-07T20:26:07.6958455Z 
2025-05-07T20:26:07.6958462Z 
2025-05-07T20:26:07.6958471Z 
2025-05-07T20:26:07.6958980Z [A[A[A
2025-05-07T20:26:07.6959170Z 
2025-05-07T20:26:07.6959182Z 
2025-05-07T20:26:07.6959188Z 
2025-05-07T20:26:07.6959194Z 
2025-05-07T20:26:07.6959666Z [A[A[A[A
2025-05-07T20:26:07.6959890Z 
2025-05-07T20:26:07.6959897Z 
2025-05-07T20:26:07.6959903Z 
2025-05-07T20:26:07.6960056Z 
2025-05-07T20:26:07.6960066Z 
2025-05-07T20:26:07.6960336Z [A[A[A[A[A
2025-05-07T20:26:07.6960577Z 
2025-05-07T20:26:07.6960583Z 
2025-05-07T20:26:07.6960589Z 
2025-05-07T20:26:07.6960717Z 
2025-05-07T20:26:07.6960724Z 
2025-05-07T20:26:07.6960736Z 
2025-05-07T20:26:07.6961144Z [A[A[A[A[A[A
2025-05-07T20:26:07.6961354Z 
2025-05-07T20:26:07.6961358Z 
2025-05-07T20:26:07.6961362Z 
2025-05-07T20:26:07.6961366Z 
2025-05-07T20:26:07.6961370Z 
2025-05-07T20:26:07.6961378Z 
2025-05-07T20:26:07.6961382Z 
2025-05-07T20:26:07.6961912Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6962099Z 
2025-05-07T20:26:07.6962105Z 
2025-05-07T20:26:07.6962109Z 
2025-05-07T20:26:07.6962122Z 
2025-05-07T20:26:07.6962126Z 
2025-05-07T20:26:07.6962130Z 
2025-05-07T20:26:07.6962134Z 
2025-05-07T20:26:07.6962138Z 
2025-05-07T20:26:07.6962544Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6962821Z 
2025-05-07T20:26:07.6962827Z 
2025-05-07T20:26:07.6962833Z 
2025-05-07T20:26:07.6962839Z 
2025-05-07T20:26:07.6962845Z 
2025-05-07T20:26:07.6962869Z 
2025-05-07T20:26:07.6962875Z 
2025-05-07T20:26:07.6962881Z 
2025-05-07T20:26:07.6962888Z 
2025-05-07T20:26:07.6963179Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6963481Z 
2025-05-07T20:26:07.6963487Z 
2025-05-07T20:26:07.6963493Z 
2025-05-07T20:26:07.6963506Z 
2025-05-07T20:26:07.6963512Z 
2025-05-07T20:26:07.6963517Z 
2025-05-07T20:26:07.6963523Z 
2025-05-07T20:26:07.6963529Z 
2025-05-07T20:26:07.6963534Z 
2025-05-07T20:26:07.6963540Z 
2025-05-07T20:26:07.6963925Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6964186Z 
2025-05-07T20:26:07.6964191Z 
2025-05-07T20:26:07.6964197Z 
2025-05-07T20:26:07.6964203Z 
2025-05-07T20:26:07.6964208Z 
2025-05-07T20:26:07.6964213Z 
2025-05-07T20:26:07.6964228Z 
2025-05-07T20:26:07.6964234Z 
2025-05-07T20:26:07.6964240Z 
2025-05-07T20:26:07.6964245Z 
2025-05-07T20:26:07.6964255Z 
2025-05-07T20:26:07.6964492Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6964764Z 
2025-05-07T20:26:07.6964777Z 
2025-05-07T20:26:07.6964783Z 
2025-05-07T20:26:07.6964806Z 
2025-05-07T20:26:07.6964812Z 
2025-05-07T20:26:07.6964817Z 
2025-05-07T20:26:07.6964823Z 
2025-05-07T20:26:07.6964828Z 
2025-05-07T20:26:07.6964834Z 
2025-05-07T20:26:07.6964845Z 
2025-05-07T20:26:07.6964851Z 
2025-05-07T20:26:07.6964856Z 
2025-05-07T20:26:07.6965179Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6965476Z 
2025-05-07T20:26:07.6965483Z 
2025-05-07T20:26:07.6965488Z 
2025-05-07T20:26:07.6965493Z 
2025-05-07T20:26:07.6965507Z 
2025-05-07T20:26:07.6965512Z 
2025-05-07T20:26:07.6965518Z 
2025-05-07T20:26:07.6965523Z 
2025-05-07T20:26:07.6965529Z 
2025-05-07T20:26:07.6965534Z 
2025-05-07T20:26:07.6965540Z 
2025-05-07T20:26:07.6965553Z 
2025-05-07T20:26:07.6965558Z 
2025-05-07T20:26:07.6965772Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6966066Z 
2025-05-07T20:26:07.6966072Z 
2025-05-07T20:26:07.6966077Z 
2025-05-07T20:26:07.6966083Z 
2025-05-07T20:26:07.6966088Z 
2025-05-07T20:26:07.6966094Z 
2025-05-07T20:26:07.6966099Z 
2025-05-07T20:26:07.6966105Z 
2025-05-07T20:26:07.6966129Z 
2025-05-07T20:26:07.6966135Z 
2025-05-07T20:26:07.6966141Z 
2025-05-07T20:26:07.6966146Z 
2025-05-07T20:26:07.6966151Z 
2025-05-07T20:26:07.6966156Z 
2025-05-07T20:26:07.6966382Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6966682Z 
2025-05-07T20:26:07.6966688Z 
2025-05-07T20:26:07.6966693Z 
2025-05-07T20:26:07.6966699Z 
2025-05-07T20:26:07.6966705Z 
2025-05-07T20:26:07.6966710Z 
2025-05-07T20:26:07.6966716Z 
2025-05-07T20:26:07.6966722Z 
2025-05-07T20:26:07.6966728Z 
2025-05-07T20:26:07.6966733Z 
2025-05-07T20:26:07.6966738Z 
2025-05-07T20:26:07.6966743Z 
2025-05-07T20:26:07.6966763Z 
2025-05-07T20:26:07.6966768Z 
2025-05-07T20:26:07.6966774Z 
2025-05-07T20:26:07.6966999Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6967302Z 
2025-05-07T20:26:07.6967308Z 
2025-05-07T20:26:07.6967321Z 
2025-05-07T20:26:07.6967326Z 
2025-05-07T20:26:07.6967331Z 
2025-05-07T20:26:07.6967336Z 
2025-05-07T20:26:07.6967341Z 
2025-05-07T20:26:07.6967347Z 
2025-05-07T20:26:07.6967503Z 
2025-05-07T20:26:07.6967509Z 
2025-05-07T20:26:07.6967515Z 
2025-05-07T20:26:07.6967520Z 
2025-05-07T20:26:07.6967526Z 
2025-05-07T20:26:07.6967532Z 
2025-05-07T20:26:07.6967628Z 
2025-05-07T20:26:07.6967634Z 
2025-05-07T20:26:07.6967899Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6968214Z 
2025-05-07T20:26:07.6968220Z 
2025-05-07T20:26:07.6968225Z 
2025-05-07T20:26:07.6968231Z 
2025-05-07T20:26:07.6968237Z 
2025-05-07T20:26:07.6968242Z 
2025-05-07T20:26:07.6968248Z 
2025-05-07T20:26:07.6968253Z 
2025-05-07T20:26:07.6968259Z 
2025-05-07T20:26:07.6968264Z 
2025-05-07T20:26:07.6968270Z 
2025-05-07T20:26:07.6968275Z 
2025-05-07T20:26:07.6968281Z 
2025-05-07T20:26:07.6968286Z 
2025-05-07T20:26:07.6968292Z 
2025-05-07T20:26:07.6968305Z 
2025-05-07T20:26:07.6968310Z 
2025-05-07T20:26:07.6968560Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6968884Z 
2025-05-07T20:26:07.6968887Z 
2025-05-07T20:26:07.6968898Z 
2025-05-07T20:26:07.6968902Z 
2025-05-07T20:26:07.6968913Z 
2025-05-07T20:26:07.6968918Z 
2025-05-07T20:26:07.6968921Z 
2025-05-07T20:26:07.6968926Z 
2025-05-07T20:26:07.6968929Z 
2025-05-07T20:26:07.6968934Z 
2025-05-07T20:26:07.6968943Z 
2025-05-07T20:26:07.6968947Z 
2025-05-07T20:26:07.6968951Z 
2025-05-07T20:26:07.6968955Z 
2025-05-07T20:26:07.6968959Z 
2025-05-07T20:26:07.6968963Z 
2025-05-07T20:26:07.6968967Z 
2025-05-07T20:26:07.6968971Z 
2025-05-07T20:26:07.6969617Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6970014Z 
2025-05-07T20:26:07.6970020Z 
2025-05-07T20:26:07.6970217Z [A
2025-05-07T20:26:07.6970420Z 
2025-05-07T20:26:07.6970427Z 
2025-05-07T20:26:07.6970606Z [A[A
2025-05-07T20:26:07.6970802Z 
2025-05-07T20:26:07.6970808Z 
2025-05-07T20:26:07.6970821Z 
2025-05-07T20:26:07.6971024Z [A[A[A
2025-05-07T20:26:07.6971238Z 
2025-05-07T20:26:07.6971244Z 
2025-05-07T20:26:07.6971251Z 
2025-05-07T20:26:07.6971261Z 
2025-05-07T20:26:07.6971690Z [A[A[A[A
2025-05-07T20:26:07.6971851Z 
2025-05-07T20:26:07.6971864Z 
2025-05-07T20:26:07.6971868Z 
2025-05-07T20:26:07.6971872Z 
2025-05-07T20:26:07.6971879Z 
2025-05-07T20:26:07.6972160Z [A[A[A[A[A
2025-05-07T20:26:07.6972377Z 
2025-05-07T20:26:07.6972390Z 
2025-05-07T20:26:07.6972401Z 
2025-05-07T20:26:07.6972407Z 
2025-05-07T20:26:07.6972413Z 
2025-05-07T20:26:07.6972419Z 
2025-05-07T20:26:07.6972821Z [A[A[A[A[A[A
2025-05-07T20:26:07.6973033Z 
2025-05-07T20:26:07.6973039Z 
2025-05-07T20:26:07.6973050Z 
2025-05-07T20:26:07.6973055Z 
2025-05-07T20:26:07.6973061Z 
2025-05-07T20:26:07.6973066Z 
2025-05-07T20:26:07.6973072Z 
2025-05-07T20:26:07.6973480Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6973718Z 
2025-05-07T20:26:07.6973724Z 
2025-05-07T20:26:07.6973730Z 
2025-05-07T20:26:07.6973746Z 
2025-05-07T20:26:07.6973752Z 
2025-05-07T20:26:07.6973758Z 
2025-05-07T20:26:07.6973764Z 
2025-05-07T20:26:07.6973769Z 
2025-05-07T20:26:07.6973971Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6974217Z 
2025-05-07T20:26:07.6974223Z 
2025-05-07T20:26:07.6974239Z 
2025-05-07T20:26:07.6974244Z 
2025-05-07T20:26:07.6974250Z 
2025-05-07T20:26:07.6974255Z 
2025-05-07T20:26:07.6974261Z 
2025-05-07T20:26:07.6974270Z 
2025-05-07T20:26:07.6974277Z 
2025-05-07T20:26:07.6974491Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6974743Z 
2025-05-07T20:26:07.6974748Z 
2025-05-07T20:26:07.6974754Z 
2025-05-07T20:26:07.6974760Z 
2025-05-07T20:26:07.6974765Z 
2025-05-07T20:26:07.6974771Z 
2025-05-07T20:26:07.6974776Z 
2025-05-07T20:26:07.6974781Z 
2025-05-07T20:26:07.6974787Z 
2025-05-07T20:26:07.6974793Z 
2025-05-07T20:26:07.6975018Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6975273Z 
2025-05-07T20:26:07.6975278Z 
2025-05-07T20:26:07.6975284Z 
2025-05-07T20:26:07.6975290Z 
2025-05-07T20:26:07.6975295Z 
2025-05-07T20:26:07.6975305Z 
2025-05-07T20:26:07.6975311Z 
2025-05-07T20:26:07.6975326Z 
2025-05-07T20:26:07.6975332Z 
2025-05-07T20:26:07.6975337Z 
2025-05-07T20:26:07.6975343Z 
2025-05-07T20:26:07.6975559Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6975836Z 
2025-05-07T20:26:07.6975998Z 
2025-05-07T20:26:07.6976004Z 
2025-05-07T20:26:07.6976011Z 
2025-05-07T20:26:07.6976017Z 
2025-05-07T20:26:07.6976023Z 
2025-05-07T20:26:07.6976028Z 
2025-05-07T20:26:07.6976119Z 
2025-05-07T20:26:07.6976125Z 
2025-05-07T20:26:07.6976131Z 
2025-05-07T20:26:07.6976136Z 
2025-05-07T20:26:07.6976142Z 
2025-05-07T20:26:07.6976372Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6976665Z 
2025-05-07T20:26:07.6976671Z 
2025-05-07T20:26:07.6976676Z 
2025-05-07T20:26:07.6976682Z 
2025-05-07T20:26:07.6976687Z 
2025-05-07T20:26:07.6976693Z 
2025-05-07T20:26:07.6976698Z 
2025-05-07T20:26:07.6976703Z 
2025-05-07T20:26:07.6976709Z 
2025-05-07T20:26:07.6976714Z 
2025-05-07T20:26:07.6976720Z 
2025-05-07T20:26:07.6976725Z 
2025-05-07T20:26:07.6976731Z 
2025-05-07T20:26:07.6976955Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6977239Z 
2025-05-07T20:26:07.6977245Z 
2025-05-07T20:26:07.6977250Z 
2025-05-07T20:26:07.6977256Z 
2025-05-07T20:26:07.6977261Z 
2025-05-07T20:26:07.6977277Z 
2025-05-07T20:26:07.6977282Z 
2025-05-07T20:26:07.6977288Z 
2025-05-07T20:26:07.6977293Z 
2025-05-07T20:26:07.6977299Z 
2025-05-07T20:26:07.6977305Z 
2025-05-07T20:26:07.6977326Z 
2025-05-07T20:26:07.6977333Z 
2025-05-07T20:26:07.6977338Z 
2025-05-07T20:26:07.6977565Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6977866Z 
2025-05-07T20:26:07.6977872Z 
2025-05-07T20:26:07.6977877Z 
2025-05-07T20:26:07.6977893Z 
2025-05-07T20:26:07.6977899Z 
2025-05-07T20:26:07.6977904Z 
2025-05-07T20:26:07.6977910Z 
2025-05-07T20:26:07.6977915Z 
2025-05-07T20:26:07.6977921Z 
2025-05-07T20:26:07.6977926Z 
2025-05-07T20:26:07.6977932Z 
2025-05-07T20:26:07.6977937Z 
2025-05-07T20:26:07.6977943Z 
2025-05-07T20:26:07.6977948Z 
2025-05-07T20:26:07.6977954Z 
2025-05-07T20:26:07.6978200Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6978512Z 
2025-05-07T20:26:07.6978518Z 
2025-05-07T20:26:07.6978523Z 
2025-05-07T20:26:07.6978529Z 
2025-05-07T20:26:07.6978534Z 
2025-05-07T20:26:07.6978547Z 
2025-05-07T20:26:07.6978553Z 
2025-05-07T20:26:07.6978558Z 
2025-05-07T20:26:07.6978564Z 
2025-05-07T20:26:07.6978569Z 
2025-05-07T20:26:07.6978575Z 
2025-05-07T20:26:07.6978580Z 
2025-05-07T20:26:07.6978591Z 
2025-05-07T20:26:07.6978597Z 
2025-05-07T20:26:07.6978603Z 
2025-05-07T20:26:07.6978609Z 
2025-05-07T20:26:07.6978847Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6979158Z 
2025-05-07T20:26:07.6979175Z 
2025-05-07T20:26:07.6979181Z 
2025-05-07T20:26:07.6979186Z 
2025-05-07T20:26:07.6979199Z 
2025-05-07T20:26:07.6979205Z 
2025-05-07T20:26:07.6979210Z 
2025-05-07T20:26:07.6979216Z 
2025-05-07T20:26:07.6979221Z 
2025-05-07T20:26:07.6979227Z 
2025-05-07T20:26:07.6979232Z 
2025-05-07T20:26:07.6979237Z 
2025-05-07T20:26:07.6979243Z 
2025-05-07T20:26:07.6979248Z 
2025-05-07T20:26:07.6979254Z 
2025-05-07T20:26:07.6979259Z 
2025-05-07T20:26:07.6979265Z 
2025-05-07T20:26:07.6979501Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6979829Z 
2025-05-07T20:26:07.6979844Z 
2025-05-07T20:26:07.6979849Z 
2025-05-07T20:26:07.6979855Z 
2025-05-07T20:26:07.6979860Z 
2025-05-07T20:26:07.6979866Z 
2025-05-07T20:26:07.6979871Z 
2025-05-07T20:26:07.6979883Z 
2025-05-07T20:26:07.6979889Z 
2025-05-07T20:26:07.6979895Z 
2025-05-07T20:26:07.6979900Z 
2025-05-07T20:26:07.6979905Z 
2025-05-07T20:26:07.6979910Z 
2025-05-07T20:26:07.6979915Z 
2025-05-07T20:26:07.6979921Z 
2025-05-07T20:26:07.6979926Z 
2025-05-07T20:26:07.6979932Z 
2025-05-07T20:26:07.6979937Z 
2025-05-07T20:26:07.6980200Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6980522Z 
2025-05-07T20:26:07.6980527Z 
2025-05-07T20:26:07.6980687Z [A
2025-05-07T20:26:07.6980850Z 
2025-05-07T20:26:07.6980856Z 
2025-05-07T20:26:07.6981018Z [A[A
2025-05-07T20:26:07.6981193Z 
2025-05-07T20:26:07.6981198Z 
2025-05-07T20:26:07.6981204Z 
2025-05-07T20:26:07.6981363Z [A[A[A
2025-05-07T20:26:07.6981540Z 
2025-05-07T20:26:07.6981546Z 
2025-05-07T20:26:07.6981551Z 
2025-05-07T20:26:07.6981557Z 
2025-05-07T20:26:07.6981845Z [A[A[A[A
2025-05-07T20:26:07.6982036Z 
2025-05-07T20:26:07.6982042Z 
2025-05-07T20:26:07.6982048Z 
2025-05-07T20:26:07.6982054Z 
2025-05-07T20:26:07.6982059Z 
2025-05-07T20:26:07.6982324Z [A[A[A[A[A
2025-05-07T20:26:07.6982519Z 
2025-05-07T20:26:07.6982525Z 
2025-05-07T20:26:07.6982539Z 
2025-05-07T20:26:07.6982545Z 
2025-05-07T20:26:07.6982550Z 
2025-05-07T20:26:07.6982556Z 
2025-05-07T20:26:07.6982735Z [A[A[A[A[A[A
2025-05-07T20:26:07.6982933Z 
2025-05-07T20:26:07.6982939Z 
2025-05-07T20:26:07.6982952Z 
2025-05-07T20:26:07.6982958Z 
2025-05-07T20:26:07.6982963Z 
2025-05-07T20:26:07.6982969Z 
2025-05-07T20:26:07.6982974Z 
2025-05-07T20:26:07.6983158Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6983371Z 
2025-05-07T20:26:07.6983388Z 
2025-05-07T20:26:07.6983393Z 
2025-05-07T20:26:07.6983399Z 
2025-05-07T20:26:07.6983404Z 
2025-05-07T20:26:07.6983410Z 
2025-05-07T20:26:07.6983415Z 
2025-05-07T20:26:07.6983421Z 
2025-05-07T20:26:07.6983614Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6983868Z 
2025-05-07T20:26:07.6983874Z 
2025-05-07T20:26:07.6983879Z 
2025-05-07T20:26:07.6983885Z 
2025-05-07T20:26:07.6983890Z 
2025-05-07T20:26:07.6983896Z 
2025-05-07T20:26:07.6983909Z 
2025-05-07T20:26:07.6983914Z 
2025-05-07T20:26:07.6983920Z 
2025-05-07T20:26:07.6984126Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6984389Z 
2025-05-07T20:26:07.6984395Z 
2025-05-07T20:26:07.6984401Z 
2025-05-07T20:26:07.6984407Z 
2025-05-07T20:26:07.6984413Z 
2025-05-07T20:26:07.6984418Z 
2025-05-07T20:26:07.6984423Z 
2025-05-07T20:26:07.6984429Z 
2025-05-07T20:26:07.6984434Z 
2025-05-07T20:26:07.6984440Z 
2025-05-07T20:26:07.6984663Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6984938Z 
2025-05-07T20:26:07.6984944Z 
2025-05-07T20:26:07.6984950Z 
2025-05-07T20:26:07.6984956Z 
2025-05-07T20:26:07.6984961Z 
2025-05-07T20:26:07.6984967Z 
2025-05-07T20:26:07.6984973Z 
2025-05-07T20:26:07.6984979Z 
2025-05-07T20:26:07.6984984Z 
2025-05-07T20:26:07.6984990Z 
2025-05-07T20:26:07.6984996Z 
2025-05-07T20:26:07.6985181Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6985376Z 
2025-05-07T20:26:07.6985380Z 
2025-05-07T20:26:07.6985385Z 
2025-05-07T20:26:07.6985389Z 
2025-05-07T20:26:07.6985398Z 
2025-05-07T20:26:07.6985403Z 
2025-05-07T20:26:07.6985407Z 
2025-05-07T20:26:07.6985411Z 
2025-05-07T20:26:07.6985415Z 
2025-05-07T20:26:07.6985419Z 
2025-05-07T20:26:07.6985430Z 
2025-05-07T20:26:07.6985434Z 
2025-05-07T20:26:07.6985594Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6985868Z 
2025-05-07T20:26:07.6985874Z 
2025-05-07T20:26:07.6985882Z 
2025-05-07T20:26:07.6985889Z 
2025-05-07T20:26:07.6985895Z 
2025-05-07T20:26:07.6985902Z 
2025-05-07T20:26:07.6985930Z 
2025-05-07T20:26:07.6985935Z 
2025-05-07T20:26:07.6985941Z 
2025-05-07T20:26:07.6985946Z 
2025-05-07T20:26:07.6985951Z 
2025-05-07T20:26:07.6985956Z 
2025-05-07T20:26:07.6985962Z 
2025-05-07T20:26:07.6986176Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6986486Z 
2025-05-07T20:26:07.6986492Z 
2025-05-07T20:26:07.6986497Z 
2025-05-07T20:26:07.6986511Z 
2025-05-07T20:26:07.6986517Z 
2025-05-07T20:26:07.6986523Z 
2025-05-07T20:26:07.6986529Z 
2025-05-07T20:26:07.6986534Z 
2025-05-07T20:26:07.6986540Z 
2025-05-07T20:26:07.6986551Z 
2025-05-07T20:26:07.6986557Z 
2025-05-07T20:26:07.6986563Z 
2025-05-07T20:26:07.6986568Z 
2025-05-07T20:26:07.6986574Z 
2025-05-07T20:26:07.6986810Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6987106Z 
2025-05-07T20:26:07.6987111Z 
2025-05-07T20:26:07.6987117Z 
2025-05-07T20:26:07.6987122Z 
2025-05-07T20:26:07.6987128Z 
2025-05-07T20:26:07.6987133Z 
2025-05-07T20:26:07.6987139Z 
2025-05-07T20:26:07.6987144Z 
2025-05-07T20:26:07.6987163Z 
2025-05-07T20:26:07.6987169Z 
2025-05-07T20:26:07.6987175Z 
2025-05-07T20:26:07.6987188Z 
2025-05-07T20:26:07.6987193Z 
2025-05-07T20:26:07.6987199Z 
2025-05-07T20:26:07.6987204Z 
2025-05-07T20:26:07.6987431Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6987734Z 
2025-05-07T20:26:07.6987739Z 
2025-05-07T20:26:07.6987745Z 
2025-05-07T20:26:07.6987925Z 
2025-05-07T20:26:07.6987930Z 
2025-05-07T20:26:07.6987935Z 
2025-05-07T20:26:07.6987940Z 
2025-05-07T20:26:07.6987946Z 
2025-05-07T20:26:07.6987951Z 
2025-05-07T20:26:07.6988042Z 
2025-05-07T20:26:07.6988048Z 
2025-05-07T20:26:07.6988053Z 
2025-05-07T20:26:07.6988059Z 
2025-05-07T20:26:07.6988064Z 
2025-05-07T20:26:07.6988070Z 
2025-05-07T20:26:07.6988075Z 
2025-05-07T20:26:07.6988317Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6988634Z 
2025-05-07T20:26:07.6988640Z 
2025-05-07T20:26:07.6988645Z 
2025-05-07T20:26:07.6988651Z 
2025-05-07T20:26:07.6988656Z 
2025-05-07T20:26:07.6988661Z 
2025-05-07T20:26:07.6988667Z 
2025-05-07T20:26:07.6988672Z 
2025-05-07T20:26:07.6988678Z 
2025-05-07T20:26:07.6988683Z 
2025-05-07T20:26:07.6988689Z 
2025-05-07T20:26:07.6988694Z 
2025-05-07T20:26:07.6988700Z 
2025-05-07T20:26:07.6988705Z 
2025-05-07T20:26:07.6988711Z 
2025-05-07T20:26:07.6988716Z 
2025-05-07T20:26:07.6988722Z 
2025-05-07T20:26:07.6988978Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6989308Z 
2025-05-07T20:26:07.6989314Z 
2025-05-07T20:26:07.6989319Z 
2025-05-07T20:26:07.6989324Z 
2025-05-07T20:26:07.6989330Z 
2025-05-07T20:26:07.6989341Z 
2025-05-07T20:26:07.6989358Z 
2025-05-07T20:26:07.6989364Z 
2025-05-07T20:26:07.6989369Z 
2025-05-07T20:26:07.6989375Z 
2025-05-07T20:26:07.6989380Z 
2025-05-07T20:26:07.6989386Z 
2025-05-07T20:26:07.6989391Z 
2025-05-07T20:26:07.6989397Z 
2025-05-07T20:26:07.6989402Z 
2025-05-07T20:26:07.6989408Z 
2025-05-07T20:26:07.6989413Z 
2025-05-07T20:26:07.6989419Z 
2025-05-07T20:26:07.6989667Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6990001Z 
2025-05-07T20:26:07.6990006Z 
2025-05-07T20:26:07.6990159Z [A
2025-05-07T20:26:07.6990319Z 
2025-05-07T20:26:07.6990333Z 
2025-05-07T20:26:07.6990489Z [A[A
2025-05-07T20:26:07.6990651Z 
2025-05-07T20:26:07.6990657Z 
2025-05-07T20:26:07.6990673Z 
2025-05-07T20:26:07.6990842Z [A[A[A
2025-05-07T20:26:07.6991015Z 
2025-05-07T20:26:07.6991021Z 
2025-05-07T20:26:07.6991034Z 
2025-05-07T20:26:07.6991040Z 
2025-05-07T20:26:07.6991211Z [A[A[A[A
2025-05-07T20:26:07.6991392Z 
2025-05-07T20:26:07.6991398Z 
2025-05-07T20:26:07.6991403Z 
2025-05-07T20:26:07.6991415Z 
2025-05-07T20:26:07.6991420Z 
2025-05-07T20:26:07.6991587Z [A[A[A[A[A
2025-05-07T20:26:07.6991785Z 
2025-05-07T20:26:07.6991790Z 
2025-05-07T20:26:07.6991796Z 
2025-05-07T20:26:07.6991801Z 
2025-05-07T20:26:07.6991807Z 
2025-05-07T20:26:07.6991812Z 
2025-05-07T20:26:07.6991984Z [A[A[A[A[A[A
2025-05-07T20:26:07.6992190Z 
2025-05-07T20:26:07.6992196Z 
2025-05-07T20:26:07.6992201Z 
2025-05-07T20:26:07.6992207Z 
2025-05-07T20:26:07.6992212Z 
2025-05-07T20:26:07.6992218Z 
2025-05-07T20:26:07.6992223Z 
2025-05-07T20:26:07.6992421Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.6992637Z 
2025-05-07T20:26:07.6992642Z 
2025-05-07T20:26:07.6992647Z 
2025-05-07T20:26:07.6992652Z 
2025-05-07T20:26:07.6992657Z 
2025-05-07T20:26:07.6992663Z 
2025-05-07T20:26:07.6992668Z 
2025-05-07T20:26:07.6992674Z 
2025-05-07T20:26:07.6992871Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6993102Z 
2025-05-07T20:26:07.6993107Z 
2025-05-07T20:26:07.6993113Z 
2025-05-07T20:26:07.6993118Z 
2025-05-07T20:26:07.6993131Z 
2025-05-07T20:26:07.6993136Z 
2025-05-07T20:26:07.6993142Z 
2025-05-07T20:26:07.6993148Z 
2025-05-07T20:26:07.6993153Z 
2025-05-07T20:26:07.6993352Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6993723Z 
2025-05-07T20:26:07.6993728Z 
2025-05-07T20:26:07.6993734Z 
2025-05-07T20:26:07.6993739Z 
2025-05-07T20:26:07.6993745Z 
2025-05-07T20:26:07.6993750Z 
2025-05-07T20:26:07.6993755Z 
2025-05-07T20:26:07.6993761Z 
2025-05-07T20:26:07.6993767Z 
2025-05-07T20:26:07.6993772Z 
2025-05-07T20:26:07.6993990Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6994246Z 
2025-05-07T20:26:07.6994252Z 
2025-05-07T20:26:07.6994257Z 
2025-05-07T20:26:07.6994263Z 
2025-05-07T20:26:07.6994268Z 
2025-05-07T20:26:07.6994274Z 
2025-05-07T20:26:07.6994279Z 
2025-05-07T20:26:07.6994285Z 
2025-05-07T20:26:07.6994290Z 
2025-05-07T20:26:07.6994413Z 
2025-05-07T20:26:07.6994428Z 
2025-05-07T20:26:07.6994656Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6994933Z 
2025-05-07T20:26:07.6994938Z 
2025-05-07T20:26:07.6995036Z 
2025-05-07T20:26:07.6995043Z 
2025-05-07T20:26:07.6995048Z 
2025-05-07T20:26:07.6995054Z 
2025-05-07T20:26:07.6995059Z 
2025-05-07T20:26:07.6995064Z 
2025-05-07T20:26:07.6995070Z 
2025-05-07T20:26:07.6995075Z 
2025-05-07T20:26:07.6995081Z 
2025-05-07T20:26:07.6995086Z 
2025-05-07T20:26:07.6995294Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6995584Z 
2025-05-07T20:26:07.6995589Z 
2025-05-07T20:26:07.6995595Z 
2025-05-07T20:26:07.6995600Z 
2025-05-07T20:26:07.6995606Z 
2025-05-07T20:26:07.6995611Z 
2025-05-07T20:26:07.6995617Z 
2025-05-07T20:26:07.6995622Z 
2025-05-07T20:26:07.6995628Z 
2025-05-07T20:26:07.6995633Z 
2025-05-07T20:26:07.6995639Z 
2025-05-07T20:26:07.6995644Z 
2025-05-07T20:26:07.6995650Z 
2025-05-07T20:26:07.6995866Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6996151Z 
2025-05-07T20:26:07.6996165Z 
2025-05-07T20:26:07.6996171Z 
2025-05-07T20:26:07.6996177Z 
2025-05-07T20:26:07.6996182Z 
2025-05-07T20:26:07.6996187Z 
2025-05-07T20:26:07.6996192Z 
2025-05-07T20:26:07.6996203Z 
2025-05-07T20:26:07.6996209Z 
2025-05-07T20:26:07.6996214Z 
2025-05-07T20:26:07.6996220Z 
2025-05-07T20:26:07.6996225Z 
2025-05-07T20:26:07.6996231Z 
2025-05-07T20:26:07.6996245Z 
2025-05-07T20:26:07.6996462Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6996756Z 
2025-05-07T20:26:07.6996762Z 
2025-05-07T20:26:07.6996768Z 
2025-05-07T20:26:07.6996773Z 
2025-05-07T20:26:07.6996778Z 
2025-05-07T20:26:07.6996784Z 
2025-05-07T20:26:07.6996798Z 
2025-05-07T20:26:07.6996803Z 
2025-05-07T20:26:07.6996809Z 
2025-05-07T20:26:07.6996814Z 
2025-05-07T20:26:07.6996820Z 
2025-05-07T20:26:07.6996825Z 
2025-05-07T20:26:07.6996831Z 
2025-05-07T20:26:07.6996836Z 
2025-05-07T20:26:07.6996842Z 
2025-05-07T20:26:07.6997066Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6997381Z 
2025-05-07T20:26:07.6997394Z 
2025-05-07T20:26:07.6997399Z 
2025-05-07T20:26:07.6997405Z 
2025-05-07T20:26:07.6997410Z 
2025-05-07T20:26:07.6997416Z 
2025-05-07T20:26:07.6997421Z 
2025-05-07T20:26:07.6997432Z 
2025-05-07T20:26:07.6997438Z 
2025-05-07T20:26:07.6997443Z 
2025-05-07T20:26:07.6997449Z 
2025-05-07T20:26:07.6997454Z 
2025-05-07T20:26:07.6997459Z 
2025-05-07T20:26:07.6997465Z 
2025-05-07T20:26:07.6997470Z 
2025-05-07T20:26:07.6997476Z 
2025-05-07T20:26:07.6997714Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6998025Z 
2025-05-07T20:26:07.6998031Z 
2025-05-07T20:26:07.6998035Z 
2025-05-07T20:26:07.6998041Z 
2025-05-07T20:26:07.6998046Z 
2025-05-07T20:26:07.6998052Z 
2025-05-07T20:26:07.6998083Z 
2025-05-07T20:26:07.6998089Z 
2025-05-07T20:26:07.6998094Z 
2025-05-07T20:26:07.6998100Z 
2025-05-07T20:26:07.6998105Z 
2025-05-07T20:26:07.6998111Z 
2025-05-07T20:26:07.6998116Z 
2025-05-07T20:26:07.6998122Z 
2025-05-07T20:26:07.6998127Z 
2025-05-07T20:26:07.6998133Z 
2025-05-07T20:26:07.6998145Z 
2025-05-07T20:26:07.6998389Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6998714Z 
2025-05-07T20:26:07.6998720Z 
2025-05-07T20:26:07.6998725Z 
2025-05-07T20:26:07.6998738Z 
2025-05-07T20:26:07.6998743Z 
2025-05-07T20:26:07.6998749Z 
2025-05-07T20:26:07.6998754Z 
2025-05-07T20:26:07.6998760Z 
2025-05-07T20:26:07.6998765Z 
2025-05-07T20:26:07.6998771Z 
2025-05-07T20:26:07.6998776Z 
2025-05-07T20:26:07.6998782Z 
2025-05-07T20:26:07.6998787Z 
2025-05-07T20:26:07.6998793Z 
2025-05-07T20:26:07.6998798Z 
2025-05-07T20:26:07.6998804Z 
2025-05-07T20:26:07.6998810Z 
2025-05-07T20:26:07.6998815Z 
2025-05-07T20:26:07.6999075Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6999397Z 
2025-05-07T20:26:07.6999402Z 
2025-05-07T20:26:07.6999575Z [A
2025-05-07T20:26:07.6999737Z 
2025-05-07T20:26:07.6999743Z 
2025-05-07T20:26:07.6999903Z [A[A
2025-05-07T20:26:07.7000074Z 
2025-05-07T20:26:07.7000080Z 
2025-05-07T20:26:07.7000086Z 
2025-05-07T20:26:07.7000247Z [A[A[A
2025-05-07T20:26:07.7000537Z 
2025-05-07T20:26:07.7000542Z 
2025-05-07T20:26:07.7000557Z 
2025-05-07T20:26:07.7000562Z 
2025-05-07T20:26:07.7000712Z [A[A[A[A
2025-05-07T20:26:07.7000932Z 
2025-05-07T20:26:07.7000937Z 
2025-05-07T20:26:07.7000941Z 
2025-05-07T20:26:07.7000945Z 
2025-05-07T20:26:07.7000949Z 
2025-05-07T20:26:07.7001079Z [A[A[A[A[A
2025-05-07T20:26:07.7001217Z 
2025-05-07T20:26:07.7001221Z 
2025-05-07T20:26:07.7001225Z 
2025-05-07T20:26:07.7001229Z 
2025-05-07T20:26:07.7001233Z 
2025-05-07T20:26:07.7001237Z 
2025-05-07T20:26:07.7001367Z [A[A[A[A[A[A
2025-05-07T20:26:07.7001513Z 
2025-05-07T20:26:07.7001517Z 
2025-05-07T20:26:07.7001521Z 
2025-05-07T20:26:07.7001525Z 
2025-05-07T20:26:07.7001529Z 
2025-05-07T20:26:07.7001533Z 
2025-05-07T20:26:07.7001537Z 
2025-05-07T20:26:07.7001669Z [A[A[A[A[A[A[A
2025-05-07T20:26:07.7001827Z 
2025-05-07T20:26:07.7001831Z 
2025-05-07T20:26:07.7001835Z 
2025-05-07T20:26:07.7001839Z 
2025-05-07T20:26:07.7001843Z 
2025-05-07T20:26:07.7001853Z 
2025-05-07T20:26:07.7001857Z 
2025-05-07T20:26:07.7001861Z 
2025-05-07T20:26:07.7002024Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7002192Z 
2025-05-07T20:26:07.7002196Z 
2025-05-07T20:26:07.7002205Z 
2025-05-07T20:26:07.7002209Z 
2025-05-07T20:26:07.7002213Z 
2025-05-07T20:26:07.7002217Z 
2025-05-07T20:26:07.7002221Z 
2025-05-07T20:26:07.7002231Z 
2025-05-07T20:26:07.7002235Z 
2025-05-07T20:26:07.7002369Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7002544Z 
2025-05-07T20:26:07.7002548Z 
2025-05-07T20:26:07.7002552Z 
2025-05-07T20:26:07.7002556Z 
2025-05-07T20:26:07.7002560Z 
2025-05-07T20:26:07.7002564Z 
2025-05-07T20:26:07.7002575Z 
2025-05-07T20:26:07.7002579Z 
2025-05-07T20:26:07.7002583Z 
2025-05-07T20:26:07.7002587Z 
2025-05-07T20:26:07.7002726Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7002907Z 
2025-05-07T20:26:07.7002910Z 
2025-05-07T20:26:07.7002914Z 
2025-05-07T20:26:07.7002925Z 
2025-05-07T20:26:07.7002929Z 
2025-05-07T20:26:07.7002933Z 
2025-05-07T20:26:07.7002937Z 
2025-05-07T20:26:07.7002945Z 
2025-05-07T20:26:07.7002949Z 
2025-05-07T20:26:07.7002953Z 
2025-05-07T20:26:07.7002957Z 
2025-05-07T20:26:07.7003100Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7003310Z 
2025-05-07T20:26:07.7003314Z 
2025-05-07T20:26:07.7003318Z 
2025-05-07T20:26:07.7003322Z 
2025-05-07T20:26:07.7003325Z 
2025-05-07T20:26:07.7003330Z 
2025-05-07T20:26:07.7003334Z 
2025-05-07T20:26:07.7003338Z 
2025-05-07T20:26:07.7003342Z 
2025-05-07T20:26:07.7003345Z 
2025-05-07T20:26:07.7003349Z 
2025-05-07T20:26:07.7003356Z 
2025-05-07T20:26:07.7003570Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.7003872Z 
2025-05-07T20:26:07.7003878Z 
2025-05-07T20:26:07.7003883Z 
2025-05-07T20:26:07.7003889Z 
2025-05-07T20:26:07.7003894Z 
2025-05-07T20:26:07.7003900Z 
2025-05-07T20:26:07.7003905Z 
2025-05-07T20:26:07.7003911Z 
2025-05-07T20:26:07.7003916Z 
2025-05-07T20:26:07.7003922Z 
2025-05-07T20:26:07.7003927Z 
2025-05-07T20:26:07.7003933Z 
2025-05-07T20:26:07.7003938Z 
2025-05-07T20:26:07.7004186Z [A[A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:08.0163873Z Preparing transaction: | / - done
2025-05-07T20:26:12.4783168Z Verifying transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:13.4115347Z Executing transaction: / - \ | / - \ | / done
2025-05-07T20:26:15.9355129Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:15.9355586Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:15.9371119Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:15.9371768Z 
2025-05-07T20:26:15.9371773Z 
2025-05-07T20:26:15.9372477Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:15.9373667Z 
2025-05-07T20:26:15.9385661Z 
2025-05-07T20:26:15.9385989Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:15.9392189Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:15.9396619Z 
2025-05-07T20:26:16.1119846Z 
2025-05-07T20:26:16.1125684Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:16.1129906Z 
2025-05-07T20:26:16.1149961Z 
2025-05-07T20:26:16.1150480Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:16.1546490Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:18.0780809Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:18.1406634Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:18.1407167Z 
2025-05-07T20:26:18.5761529Z 
2025-05-07T20:26:18.5770317Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:18.6128135Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:18.6128652Z 
2025-05-07T20:26:19.0500564Z 
2025-05-07T20:26:19.0500895Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:19.0501880Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:19.0502631Z 
2025-05-07T20:26:19.4746392Z 
2025-05-07T20:26:21.5981107Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:23.7629088Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:25.8688345Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:25.8689191Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:27.9343741Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:29.9547470Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:29.9547806Z 
2025-05-07T20:26:30.0221075Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:26:34.0305636Z /tmp/tmpaxofnkh1: line 3: clang: command not found
2025-05-07T20:26:34.0306047Z 
2025-05-07T20:26:34.0306471Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:26:34.0944434Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:26:34.0944886Z 
2025-05-07T20:26:34.0964568Z total 36
2025-05-07T20:26:34.0965116Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:26:34.0965688Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:24 ..
2025-05-07T20:26:34.0966269Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:26:34.0966827Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:26:34.0967559Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:26:34.0968218Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:26:34.0968850Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:26:34.0969341Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:26:34.0969640Z 
2025-05-07T20:26:34.0969867Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:26:34.0970526Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:26:34.0970960Z 
2025-05-07T20:26:34.0987816Z 
2025-05-07T20:26:34.0988336Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:26:34.0988715Z 
2025-05-07T20:26:36.1007385Z 
2025-05-07T20:26:36.1008112Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:26:36.1008957Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:26:36.1009392Z 
2025-05-07T20:26:36.5623412Z 
2025-05-07T20:26:36.5624233Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:26:36.5624634Z 
2025-05-07T20:26:38.6001790Z -allow-unsupported-compiler
2025-05-07T20:26:38.6002068Z 
2025-05-07T20:26:38.6649185Z 
2025-05-07T20:26:38.6649690Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:26:38.6650252Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:26:38.6650615Z 
2025-05-07T20:26:40.6852085Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:26:40.6852872Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:26:40.6853221Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:26:40.6853558Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:26:40.6853929Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:26:40.6854360Z #define _STL_PAIR_H 1
2025-05-07T20:26:40.6861932Z #define __cpp_attributes 200809L
2025-05-07T20:26:40.6862487Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:26:40.6862999Z #define __DELETE_THROW throw()
2025-05-07T20:26:40.6863387Z #define _PTRDIFF_T_ 
2025-05-07T20:26:40.6863744Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:26:40.6864111Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:26:40.6864441Z #define _IO_LEFT 02
2025-05-07T20:26:40.6864787Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:26:40.6865178Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:26:40.6865573Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:26:40.6866124Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:26:40.6866589Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:26:40.6867017Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:26:40.6867389Z #define _IOS_OUTPUT 2
2025-05-07T20:26:40.6867734Z #define __SM_100_RT_HPP__ 
2025-05-07T20:26:40.6868421Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:26:40.6868860Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:26:40.6869576Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:26:40.6869995Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:26:40.6870398Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:26:40.6871498Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:26:40.6872808Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:26:40.6873261Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:26:40.6873784Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:26:40.6874236Z #define _T_WCHAR_ 
2025-05-07T20:26:40.6874586Z #define stdout stdout
2025-05-07T20:26:40.6875079Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:26:40.6875711Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:26:40.6876134Z #define __flexarr []
2025-05-07T20:26:40.6876533Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:26:40.6877046Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:26:40.6877574Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:26:40.6877960Z #define _MATH_H 1
2025-05-07T20:26:40.6878355Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:26:40.6878843Z #define __S64_TYPE long int
2025-05-07T20:26:40.6879212Z #define __stub_fchflags 
2025-05-07T20:26:40.6879588Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:26:40.6880019Z #define __SQUAD_TYPE long int
2025-05-07T20:26:40.6880316Z #define __INTMAX_C(c) c ## L
2025-05-07T20:26:40.6880633Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:26:40.6881132Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:26:40.6881513Z #define NL_NMAX INT_MAX
2025-05-07T20:26:40.6881848Z #define _BITS_TIME_H 1
2025-05-07T20:26:40.6882181Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:26:40.6882532Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:26:40.6882849Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:26:40.6883225Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:26:40.6883643Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:26:40.6884027Z #define __CHAR_BIT__ 8
2025-05-07T20:26:40.6884298Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.6884631Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:26:40.6884942Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:26:40.6885216Z #define FP_NAN 0
2025-05-07T20:26:40.6885496Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:26:40.6885933Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:26:40.6886333Z #define __cudaCDP2GetErrorString 
2025-05-07T20:26:40.6887449Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:26:40.6888191Z 
2025-05-07T20:26:40.6888297Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:26:40.6888584Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:26:40.6888850Z #define __SM_80_RT_H__ 
2025-05-07T20:26:40.6889093Z #define _NEW 
2025-05-07T20:26:40.6889336Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:26:40.6889627Z #define __UINT8_MAX__ 0xff
2025-05-07T20:26:40.6890017Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:26:40.6890448Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:26:40.6890699Z #define __USE_ANSI 1
2025-05-07T20:26:40.6891003Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:26:40.6891419Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:26:40.6891799Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:26:40.6892114Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:26:40.6892413Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:26:40.6892900Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:26:40.6893201Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:26:40.6893505Z #define PIPE_BUF 4096
2025-05-07T20:26:40.6893930Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:26:40.6894408Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:26:40.6894804Z #define ADJ_TICK 0x4000
2025-05-07T20:26:40.6895097Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:26:40.6895428Z #define MQ_PRIO_MAX 32768
2025-05-07T20:26:40.6895711Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:26:40.6896049Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:26:40.6896522Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:40.6897063Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:26:40.6897446Z #define _XOPEN_SOURCE 700
2025-05-07T20:26:40.6897720Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:26:40.6898012Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.6898317Z #define __cpp_static_assert 201411L
2025-05-07T20:26:40.6898619Z #define __GLIBCXX__ 20230528
2025-05-07T20:26:40.6898899Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:26:40.6899195Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:26:40.6899494Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:26:40.6899807Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:26:40.6900118Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:26:40.6900437Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.6900813Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:26:40.6901168Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:26:40.6901468Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:26:40.6901797Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.6902169Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:26:40.6902547Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:26:40.6902867Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:26:40.6903173Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:26:40.6903521Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:26:40.6903871Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:26:40.6904290Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:26:40.6904722Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:26:40.6905044Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:26:40.6905331Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:26:40.6905623Z #define __GCC_IEC_559 2
2025-05-07T20:26:40.6905938Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:26:40.6906295Z #define _IO_flockfile(_fp) 
2025-05-07T20:26:40.6906568Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:26:40.6906857Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:26:40.6907138Z #define _IOFBF 0
2025-05-07T20:26:40.6907361Z #define __USE_BSD 1
2025-05-07T20:26:40.6907604Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:26:40.6907899Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:26:40.6908182Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:26:40.6908459Z #define _IO_NO_WRITES 8
2025-05-07T20:26:40.6908731Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:26:40.6909102Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:26:40.6909475Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:26:40.6909801Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:26:40.6910141Z #define __cpp_binary_literals 201304L
2025-05-07T20:26:40.6910445Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:26:40.6910732Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:26:40.6911015Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:26:40.6911338Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:26:40.6911745Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:26:40.6912138Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:26:40.6912456Z #define M_PI 3.14159265358979323846
2025-05-07T20:26:40.6912992Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:26:40.6913340Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:26:40.6913964Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:26:40.6914282Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:26:40.6914585Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:26:40.6914870Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:26:40.6915475Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:26:40.6916089Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:26:40.6916432Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:26:40.6916767Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:26:40.6917083Z #define __cudaCDP2GetErrorName 
2025-05-07T20:26:40.6917373Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:26:40.6917649Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:26:40.6917981Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:26:40.6918324Z #define __cpp_variadic_templates 200704L
2025-05-07T20:26:40.6918642Z #define RAND_MAX 2147483647
2025-05-07T20:26:40.6918918Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:26:40.6919261Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.6919594Z #define __SM_90_RT_H__ 
2025-05-07T20:26:40.6919848Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:26:40.6920123Z #define __COMPAR_FN_T 
2025-05-07T20:26:40.6920381Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:26:40.6920654Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:26:40.6921156Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:26:40.6921692Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:26:40.6922048Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:26:40.6922426Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:26:40.6922743Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:40.6923107Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:26:40.6923432Z #define __cpp_variable_templates 201304L
2025-05-07T20:26:40.6924439Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:40.6925015Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:26:40.6925360Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:26:40.6925692Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:26:40.6926003Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:26:40.6926323Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:26:40.6926605Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:26:40.6926882Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:26:40.6927162Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:26:40.6927423Z #define __u_char_defined 
2025-05-07T20:26:40.6927754Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:26:40.6928131Z #define STA_PPSERROR 0x0800
2025-05-07T20:26:40.6928403Z #define _GLIBCXX_STD_A std
2025-05-07T20:26:40.6928678Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:26:40.6928970Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:26:40.6929438Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:26:40.6929885Z #define FP_INFINITE 1
2025-05-07T20:26:40.6930272Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:40.6930710Z #define _IO_pid_t __pid_t
2025-05-07T20:26:40.6930980Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:26:40.6931246Z #define __LEAF , __leaf__
2025-05-07T20:26:40.6931501Z #define PATH_MAX 4096
2025-05-07T20:26:40.6931767Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:26:40.6932113Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:26:40.6932448Z #define _LIMITS_H___ 
2025-05-07T20:26:40.6932689Z #define __size_t 
2025-05-07T20:26:40.6932924Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:26:40.6933488Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:26:40.6934367Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:26:40.6934861Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:26:40.6935208Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:26:40.6935484Z #define _WCHAR_T_DEFINED 
2025-05-07T20:26:40.6935862Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:26:40.6936273Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:26:40.6936585Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:26:40.6936931Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:26:40.6937225Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:26:40.6937526Z #define __INT8_C(c) c
2025-05-07T20:26:40.6937803Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:26:40.6938119Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:26:40.6938391Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:26:40.6938664Z #define __SM_70_RT_HPP__ 
2025-05-07T20:26:40.6938930Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:26:40.6939218Z #define __cpp_variadic_using 201611L
2025-05-07T20:26:40.6939559Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.6939910Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:26:40.6940191Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:26:40.6940480Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:26:40.6940758Z #define __cpp_capture_star_this 201603L
2025-05-07T20:26:40.6941082Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:26:40.6941403Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:26:40.6941787Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:26:40.6942184Z #define NFDBITS __NFDBITS
2025-05-07T20:26:40.6942457Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:26:40.6942764Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:26:40.6943102Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:26:40.6943430Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:26:40.6943702Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:26:40.6944013Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:26:40.6944330Z #define STA_UNSYNC 0x0040
2025-05-07T20:26:40.6944668Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:40.6945103Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:26:40.6945478Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:26:40.6945781Z #define __cpp_if_constexpr 201606L
2025-05-07T20:26:40.6946116Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:26:40.6946453Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:26:40.6946795Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:26:40.6947150Z #define __daddr_t_defined 
2025-05-07T20:26:40.6947419Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:26:40.6947703Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:26:40.6948042Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:26:40.6948578Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:26:40.6949083Z #define _ACRTIMP 
2025-05-07T20:26:40.6949320Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:26:40.6949609Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:26:40.6949911Z #define _IOS_BIN 128
2025-05-07T20:26:40.6950285Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:26:40.6950717Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.6951003Z #define UNDERFLOW 4
2025-05-07T20:26:40.6951228Z #define NAME_MAX 255
2025-05-07T20:26:40.6951478Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:26:40.6951763Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:26:40.6952052Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:26:40.6952363Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:26:40.6952758Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:26:40.6953157Z #define __ptr_t void *
2025-05-07T20:26:40.6953406Z #define M_E 2.7182818284590452354
2025-05-07T20:26:40.6953973Z #define cudaSurfaceType1D 0x01
2025-05-07T20:26:40.6954247Z #define __USE_ISOCXX11 1
2025-05-07T20:26:40.6954527Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:26:40.6955014Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:26:40.6955324Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:26:40.6955615Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:26:40.6955919Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:26:40.6956252Z #define cudaSurfaceType2D 0x02
2025-05-07T20:26:40.6956520Z #define __linux 1
2025-05-07T20:26:40.6956761Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:26:40.6957054Z #define cudaDeviceMask 0xff
2025-05-07T20:26:40.6957333Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:26:40.6957643Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:26:40.6957938Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:26:40.6958236Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:26:40.6958559Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:26:40.6958883Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:26:40.6959192Z #define _BITS_TYPES_H 1
2025-05-07T20:26:40.6959498Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:26:40.6959868Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:26:40.6960179Z #define cudaSurfaceType3D 0x03
2025-05-07T20:26:40.6960478Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:26:40.6960783Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:26:40.6961091Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:26:40.6961897Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:26:40.6962744Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:26:40.6963048Z #define __unix 1
2025-05-07T20:26:40.6963273Z #define MATH_ERRNO 1
2025-05-07T20:26:40.6963537Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:26:40.6963833Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:26:40.6964107Z #define __SM_100_RT_H__ 
2025-05-07T20:26:40.6964372Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:26:40.6964675Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:26:40.6964984Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:26:40.6965267Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:26:40.6965584Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:26:40.6966069Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:26:40.6966548Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:26:40.6966861Z #define CUDARTAPI_CDECL 
2025-05-07T20:26:40.6967133Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:26:40.6967415Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:26:40.6967717Z #define __cpp_lib_void_t 201411
2025-05-07T20:26:40.6967995Z #define _POSIX_AIO_MAX 1
2025-05-07T20:26:40.6968239Z #define __SIZE_T 
2025-05-07T20:26:40.6968503Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:26:40.6968840Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:26:40.6969149Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:26:40.6969426Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:26:40.6969711Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:26:40.6969989Z #define _ATFILE_SOURCE 1
2025-05-07T20:26:40.6970391Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:26:40.6970840Z #define __WAIT_STATUS void *
2025-05-07T20:26:40.6971121Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:26:40.6971398Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:26:40.6971683Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:26:40.6971988Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:26:40.6972273Z #define __WINT_MIN__ 0U
2025-05-07T20:26:40.6972876Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:26:40.6973547Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:26:40.6973864Z #define WUNTRACED 2
2025-05-07T20:26:40.6974206Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:26:40.6974501Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:26:40.6974802Z #define NZERO 20
2025-05-07T20:26:40.6975117Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:26:40.6975416Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:26:40.6975755Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:26:40.6976079Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:26:40.6976355Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:26:40.6976656Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:26:40.6976940Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:26:40.6977239Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:26:40.6977529Z #define EXIT_FAILURE 1
2025-05-07T20:26:40.6977781Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:26:40.6978055Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:26:40.6978340Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:26:40.6978606Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:26:40.6978902Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:26:40.6979266Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:26:40.6979642Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:26:40.6979947Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:26:40.6980212Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:26:40.6980500Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:26:40.6980804Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:26:40.6981126Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:26:40.6981434Z #define SEEK_DATA 3
2025-05-07T20:26:40.6981672Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:26:40.6981982Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:26:40.6982420Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:26:40.6982823Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:26:40.6983096Z #define __INT64_C(c) c ## L
2025-05-07T20:26:40.6983380Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:26:40.6983734Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:26:40.6984076Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:26:40.6984370Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:26:40.6984692Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:26:40.6985004Z #define STA_PPSWANDER 0x0400
2025-05-07T20:26:40.6985276Z #define __INT_WCHAR_T_H 
2025-05-07T20:26:40.6985530Z #define WSTOPPED 2
2025-05-07T20:26:40.6985775Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:26:40.6986079Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:26:40.6986346Z #define FP_NORMAL 4
2025-05-07T20:26:40.6986596Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:26:40.6986898Z #define _BITS_TIMEX_H 1
2025-05-07T20:26:40.6987151Z #define _POSIX_LINK_MAX 8
2025-05-07T20:26:40.6987418Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:26:40.6987735Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:26:40.6988197Z #define cudaTextureType1D 0x01
2025-05-07T20:26:40.6999330Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:26:40.6999674Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:26:40.6999985Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:26:40.7000300Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:26:40.7000750Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:26:40.7001221Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:26:40.7001509Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:26:40.7001785Z #define _POSIX_SOURCE 1
2025-05-07T20:26:40.7002058Z #define cudaTextureType2D 0x02
2025-05-07T20:26:40.7002339Z #define _PTR_TRAITS_H 1
2025-05-07T20:26:40.7002617Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:26:40.7002942Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:26:40.7003221Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:26:40.7003548Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:26:40.7003896Z #define cudaTextureType3D 0x03
2025-05-07T20:26:40.7004184Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:26:40.7004464Z #define CLOCK_REALTIME 0
2025-05-07T20:26:40.7004727Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:26:40.7005238Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:26:40.7005548Z #define __cpp_aligned_new 201606L
2025-05-07T20:26:40.7005926Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:26:40.7006215Z #define cudaEventBlockingSync 0x01
2025-05-07T20:26:40.7006525Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:26:40.7006811Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:26:40.7007138Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:26:40.7007459Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:26:40.7007750Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:26:40.7008025Z #define __GLIBC__ 2
2025-05-07T20:26:40.7008264Z #define __END_DECLS }
2025-05-07T20:26:40.7008520Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:26:40.7008902Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:26:40.7009302Z #define __CONCAT(x,y) x ## y
2025-05-07T20:26:40.7009567Z #define WCONTINUED 8
2025-05-07T20:26:40.7009807Z #define __STDC_HOSTED__ 1
2025-05-07T20:26:40.7010090Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:26:40.7010369Z #define _ALLOCA_H 1
2025-05-07T20:26:40.7010613Z #define __host__ __location__(host)
2025-05-07T20:26:40.7011067Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:26:40.7011520Z #define __SLONG32_TYPE int
2025-05-07T20:26:40.7011798Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:26:40.7012100Z #define _SYS_SELECT_H 1
2025-05-07T20:26:40.7012352Z #define _IO_LINE_BUF 0x200
2025-05-07T20:26:40.7012612Z #define _IOS_NOCREATE 32
2025-05-07T20:26:40.7012876Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:26:40.7013165Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:26:40.7013473Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:26:40.7013776Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:26:40.7014071Z #define __global__ __location__(global)
2025-05-07T20:26:40.7014374Z #define __GNU_LIBRARY__ 6
2025-05-07T20:26:40.7014646Z #define __cpp_decltype_auto 201304L
2025-05-07T20:26:40.7014940Z #define __DBL_DIG__ 15
2025-05-07T20:26:40.7015176Z #define TIME_UTC 1
2025-05-07T20:26:40.7015404Z #define __FLT32_DIG__ 6
2025-05-07T20:26:40.7015752Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:26:40.7016161Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:26:40.7016495Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:26:40.7016824Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:26:40.7017133Z #define _G_BUFSIZ 8192
2025-05-07T20:26:40.7017451Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:26:40.7017839Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:26:40.7018143Z #define __cudaCDP2GetDevice 
2025-05-07T20:26:40.7018439Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:26:40.7018744Z #define STA_CLOCKERR 0x1000
2025-05-07T20:26:40.7019001Z #define __GXX_WEAK__ 1
2025-05-07T20:26:40.7019269Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:40.7019597Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:26:40.7019874Z #define __SHRT_WIDTH__ 16
2025-05-07T20:26:40.7020183Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:26:40.7020544Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:26:40.7020839Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:26:40.7021135Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:26:40.7021448Z #define _G_config_h 1
2025-05-07T20:26:40.7021743Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:26:40.7022091Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:26:40.7022387Z #define _GCC_WCHAR_T 
2025-05-07T20:26:40.7022633Z #define TMP_MAX 238328
2025-05-07T20:26:40.7022881Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:26:40.7023165Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:26:40.7023442Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:40.7023726Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:26:40.7024628Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:26:40.7024932Z #define _IO_SKIPWS 01
2025-05-07T20:26:40.7025655Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:26:40.7026179Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:26:40.7026608Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:26:40.7026967Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:26:40.7027346Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:26:40.7027731Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:26:40.7028115Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:26:40.7028378Z #define le32toh(x) (x)
2025-05-07T20:26:40.7028627Z #define _SIZE_T_DEFINED 
2025-05-07T20:26:40.7028896Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:26:40.7029245Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:26:40.7029613Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:26:40.7030028Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:26:40.7030462Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:26:40.7030747Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:26:40.7031027Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:26:40.7031307Z #define _POSIX_NAME_MAX 14
2025-05-07T20:26:40.7031601Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:26:40.7032153Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:26:40.7032683Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:26:40.7033005Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:26:40.7033370Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:26:40.7033786Z #define _WCHAR_T_ 
2025-05-07T20:26:40.7034029Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:26:40.7034408Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:26:40.7034813Z #define RTSIG_MAX 32
2025-05-07T20:26:40.7035051Z #define _STDDEF_H 
2025-05-07T20:26:40.7035291Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:26:40.7035604Z #define _VA_LIST_DEFINED 
2025-05-07T20:26:40.7035899Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:26:40.7036247Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:26:40.7036667Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:26:40.7037019Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:26:40.7037322Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:26:40.7037804Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:26:40.7038358Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:26:40.7038748Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:26:40.7039084Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:26:40.7039417Z #define __unix__ 1
2025-05-07T20:26:40.7039668Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:40.7039960Z #define __INT_WIDTH__ 32
2025-05-07T20:26:40.7040223Z #define __SIZEOF_LONG__ 8
2025-05-07T20:26:40.7040478Z #define _IONBF 2
2025-05-07T20:26:40.7040937Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:26:40.7041746Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:26:40.7042304Z #define __STDC_IEC_559__ 1
2025-05-07T20:26:40.7042575Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:26:40.7042857Z #define __UINT16_C(c) c
2025-05-07T20:26:40.7043119Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:26:40.7043407Z #define STA_DEL 0x0020
2025-05-07T20:26:40.7043658Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:26:40.7043934Z #define __id_t_defined 
2025-05-07T20:26:40.7044222Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:26:40.7044691Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:26:40.7045140Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:26:40.7045422Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:26:40.7045692Z #define __DECIMAL_DIG__ 21
2025-05-07T20:26:40.7046105Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:26:40.7046387Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:26:40.7046754Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:26:40.7047038Z #define SING 2
2025-05-07T20:26:40.7047271Z #define STA_FREQHOLD 0x0080
2025-05-07T20:26:40.7047556Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7047872Z #define cudaStreamDefault 0x00
2025-05-07T20:26:40.7048240Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:26:40.7048634Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:26:40.7048919Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:26:40.7049205Z #define __gnu_linux__ 1
2025-05-07T20:26:40.7049458Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:26:40.7049725Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:26:40.7050046Z #define MAX_INPUT 255
2025-05-07T20:26:40.7050310Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:26:40.7050650Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:26:40.7051050Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:26:40.7051385Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:26:40.7051674Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:26:40.7052087Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:26:40.7052542Z #define _IO_SHOWPOS 02000
2025-05-07T20:26:40.7052892Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:26:40.7053266Z #define _Mfloat_ float
2025-05-07T20:26:40.7053544Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:26:40.7053873Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:26:40.7054170Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:26:40.7054514Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:26:40.7055083Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:26:40.7055599Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7055893Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:26:40.7056239Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:26:40.7056619Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:26:40.7056928Z #define __USE_ISOC11 1
2025-05-07T20:26:40.7057175Z #define _BSD_SIZE_T_ 
2025-05-07T20:26:40.7057425Z #define ADJ_MICRO 0x1000
2025-05-07T20:26:40.7057708Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:26:40.7057990Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:26:40.7058303Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:26:40.7058646Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:26:40.7058976Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:26:40.7059318Z #define __THROW throw ()
2025-05-07T20:26:40.7059589Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:26:40.7059899Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7060267Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:26:40.7060641Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:26:40.7060942Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:26:40.7061225Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:26:40.7061509Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:26:40.7061790Z #define L_tmpnam 20
2025-05-07T20:26:40.7062027Z #define ___int_wchar_t_h 
2025-05-07T20:26:40.7062385Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:26:40.7062787Z #define isascii(c) __isascii (c)
2025-05-07T20:26:40.7063064Z #define _T_PTRDIFF 
2025-05-07T20:26:40.7063382Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:26:40.7063758Z #define toascii(c) __toascii (c)
2025-05-07T20:26:40.7064030Z #define __GNUC__ 11
2025-05-07T20:26:40.7064290Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:26:40.7064606Z #define __GXX_RTTI 1
2025-05-07T20:26:40.7064842Z #define __pie__ 2
2025-05-07T20:26:40.7065059Z #define __MMX__ 1
2025-05-07T20:26:40.7065295Z #define __cudaCDP2Malloc 
2025-05-07T20:26:40.7065566Z #define __timespec_defined 1
2025-05-07T20:26:40.7065963Z #define L_ctermid 9
2025-05-07T20:26:40.7066203Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:40.7066654Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:26:40.7067075Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:26:40.7067461Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:26:40.7067748Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:26:40.7068054Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:26:40.7068373Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:26:40.7068707Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:26:40.7068986Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:26:40.7069446Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:26:40.7070223Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:40.7070854Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:26:40.7071183Z #define __USE_SVID 1
2025-05-07T20:26:40.7071447Z #define __constant__ __location__(constant)
2025-05-07T20:26:40.7071784Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:26:40.7072099Z #define __device__ __location__(device)
2025-05-07T20:26:40.7072437Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:26:40.7072778Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:26:40.7073060Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:26:40.7073351Z #define CUDART_DEVICE __device__
2025-05-07T20:26:40.7073876Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:26:40.7074266Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:26:40.7074566Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:26:40.7074952Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:26:40.7075351Z #define __STDC_UTF_16__ 1
2025-05-07T20:26:40.7075618Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:26:40.7076000Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:26:40.7076457Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:26:40.7076792Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:26:40.7077080Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:26:40.7077363Z #define NGROUPS_MAX 65536
2025-05-07T20:26:40.7077638Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:26:40.7077911Z #define __USE_ISOC95 1
2025-05-07T20:26:40.7078153Z #define _TIME_H 1
2025-05-07T20:26:40.7078438Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:26:40.7078769Z #define __USE_ISOC99 1
2025-05-07T20:26:40.7079110Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:26:40.7079499Z #define HOST_NAME_MAX 64
2025-05-07T20:26:40.7079770Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:26:40.7080040Z #define _IOS_ATEND 4
2025-05-07T20:26:40.7080290Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:26:40.7080633Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:26:40.7081057Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:40.7081420Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:26:40.7081725Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:26:40.7082060Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:26:40.7082395Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:26:40.7082667Z #define _STDIO_H 1
2025-05-07T20:26:40.7083075Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:26:40.7083568Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:26:40.7083948Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:40.7084342Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:26:40.7084644Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:26:40.7084930Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:26:40.7085218Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:26:40.7085521Z #define __cpp_raw_strings 200710L
2025-05-07T20:26:40.7085843Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7086303Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:26:40.7086586Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:26:40.7086961Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:26:40.7087284Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:26:40.7087569Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:26:40.7087877Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:26:40.7088250Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:26:40.7088636Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:26:40.7088893Z #define __USE_XOPEN 1
2025-05-07T20:26:40.7089153Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:26:40.7089616Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:40.7090072Z #define __USE_XOPEN2K 1
2025-05-07T20:26:40.7090335Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:26:40.7090622Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:26:40.7090929Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:26:40.7091230Z #define __cpp_fold_expressions 201603L
2025-05-07T20:26:40.7091784Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:26:40.7092329Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:26:40.7092631Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:26:40.7093011Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:26:40.7093421Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:26:40.7093815Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:26:40.7094229Z #define __END_NAMESPACE_C99 
2025-05-07T20:26:40.7094520Z #define __glibcxx_integral_traps true
2025-05-07T20:26:40.7094820Z #define _POSIX_PATH_MAX 256
2025-05-07T20:26:40.7095094Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:26:40.7095373Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:26:40.7095648Z #define _IOS_TRUNC 16
2025-05-07T20:26:40.7095892Z #define _ISOC11_SOURCE 1
2025-05-07T20:26:40.7096160Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:26:40.7096468Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:26:40.7096783Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:26:40.7097177Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:26:40.7097578Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:26:40.7097865Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:26:40.7098144Z #define _IO_UNITBUF 020000
2025-05-07T20:26:40.7098418Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:26:40.7098693Z #define __FD_SETSIZE 1024
2025-05-07T20:26:40.7098962Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:26:40.7099251Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:26:40.7099604Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:26:40.7099979Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:26:40.7100263Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:26:40.7100584Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:26:40.7100925Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:26:40.7101221Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:26:40.7101537Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:26:40.7101899Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:26:40.7102203Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:26:40.7102546Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:26:40.7102845Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:26:40.7103134Z #define __USE_POSIX199506 1
2025-05-07T20:26:40.7103400Z #define _FEATURES_H 1
2025-05-07T20:26:40.7103652Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:26:40.7104069Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:26:40.7104571Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:26:40.7104910Z #define __stub_getmsg 
2025-05-07T20:26:40.7105163Z #define _IO_FIXED 010000
2025-05-07T20:26:40.7105456Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:26:40.7105782Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:26:40.7106196Z #define __stub_setlogin 
2025-05-07T20:26:40.7106454Z #define __stub_fattach 
2025-05-07T20:26:40.7106710Z #define __cplusplus 201703L
2025-05-07T20:26:40.7107067Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:26:40.7107375Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:26:40.7107648Z #define INFINITY (__builtin_inff())
2025-05-07T20:26:40.7107941Z #define _IO_UNBUFFERED 2
2025-05-07T20:26:40.7108448Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:26:40.7109001Z #define _IO_INTERNAL 010
2025-05-07T20:26:40.7109260Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:26:40.7109618Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:26:40.7109990Z #define __dev_t_defined 
2025-05-07T20:26:40.7110242Z #define __DEPRECATED 1
2025-05-07T20:26:40.7110491Z #define __S32_TYPE int
2025-05-07T20:26:40.7110763Z #define __cpp_rvalue_references 200610L
2025-05-07T20:26:40.7111070Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:26:40.7111349Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:26:40.7111627Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:26:40.7112264Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:26:40.7112916Z #define _G_HAVE_MREMAP 1
2025-05-07T20:26:40.7113246Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:40.7113740Z #define OVERFLOW 3
2025-05-07T20:26:40.7113998Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:26:40.7114324Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:26:40.7114627Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:40.7114979Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:26:40.7115329Z #define __SSE2_MATH__ 1
2025-05-07T20:26:40.7115590Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:26:40.7115912Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:40.7116232Z #define _IO_STDIO_H 
2025-05-07T20:26:40.7116499Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:26:40.7116815Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:26:40.7117147Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:26:40.7117470Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7117797Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:26:40.7118075Z #define __amd64 1
2025-05-07T20:26:40.7118313Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:26:40.7118598Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:26:40.7118887Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:26:40.7119194Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:26:40.7119522Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:26:40.7119802Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:26:40.7120115Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:26:40.7120395Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:26:40.7120657Z #define __bounded 
2025-05-07T20:26:40.7120908Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:26:40.7121313Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:26:40.7121727Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:26:40.7122423Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:26:40.7122763Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:26:40.7123168Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.7123674Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:26:40.7124646Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:40.7125221Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:26:40.7125689Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:26:40.7126177Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:26:40.7126592Z #define STA_PLL 0x0001
2025-05-07T20:26:40.7127016Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:26:40.7127418Z #define __GNUG__ 11
2025-05-07T20:26:40.7136509Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:26:40.7136845Z #define _T_WCHAR 
2025-05-07T20:26:40.7137107Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:26:40.7137429Z #define __specialization_static 
2025-05-07T20:26:40.7137753Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:26:40.7138429Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:26:40.7138715Z #define cudaArraySparse 0x40
2025-05-07T20:26:40.7139129Z #define STA_PPSFREQ 0x0002
2025-05-07T20:26:40.7139436Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:26:40.7139757Z #define _WCHAR_T 
2025-05-07T20:26:40.7139990Z #define __cudaCDP2Free 
2025-05-07T20:26:40.7140654Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:26:40.7141362Z #define __cpp_nsdmi 200809L
2025-05-07T20:26:40.7141805Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:26:40.7142267Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:26:40.7142568Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:26:40.7142852Z #define cudaArrayCubemap 0x04
2025-05-07T20:26:40.7143208Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:40.7143580Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:26:40.7143841Z #define __NO_CTYPE 1
2025-05-07T20:26:40.7144085Z #define __stub_bdflush 
2025-05-07T20:26:40.7144479Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:26:40.7144928Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:26:40.7145254Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:26:40.7145536Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:26:40.7145833Z #define __cpp_initializer_lists 200806L
2025-05-07T20:26:40.7146160Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:26:40.7146469Z #define __U16_TYPE unsigned short int
2025-05-07T20:26:40.7146831Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:26:40.7147201Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:26:40.7147495Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:26:40.7147795Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:26:40.7148162Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:26:40.7148519Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:26:40.7148812Z #define _IO_STDIO 040000
2025-05-07T20:26:40.7149164Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:26:40.7149567Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:26:40.7149905Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:26:40.7150206Z #define _PTRDIFF_T 
2025-05-07T20:26:40.7150438Z #define _MOVE_H 1
2025-05-07T20:26:40.7150680Z #define __cpp_hex_float 201603L
2025-05-07T20:26:40.7150951Z #define ADJ_TAI 0x0080
2025-05-07T20:26:40.7151197Z #define __ptrvalue 
2025-05-07T20:26:40.7151434Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:26:40.7151693Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:26:40.7151993Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:26:40.7152310Z #define MATH_ERREXCEPT 2
2025-05-07T20:26:40.7152572Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:26:40.7152875Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:26:40.7153296Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:26:40.7153833Z #define __USE_GNU 1
2025-05-07T20:26:40.7154082Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:26:40.7154377Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:26:40.7154661Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:26:40.7155063Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:26:40.7155469Z #define WEXITED 4
2025-05-07T20:26:40.7155699Z #define _IO_NO_READS 4
2025-05-07T20:26:40.7156012Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:26:40.7156402Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:26:40.7156722Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:26:40.7157033Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:26:40.7157363Z #define __uid_t_defined 
2025-05-07T20:26:40.7157631Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:26:40.7157929Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:26:40.7158358Z #define WNOHANG 1
2025-05-07T20:26:40.7158618Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:26:40.7159028Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:26:40.7159318Z #define cudaEventDefault 0x00
2025-05-07T20:26:40.7159636Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:26:40.7159972Z #define NL_SETMAX INT_MAX
2025-05-07T20:26:40.7160222Z #define __x86_64 1
2025-05-07T20:26:40.7160471Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:26:40.7160888Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:40.7161383Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:26:40.7161904Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:40.7162357Z #define __PTRDIFF_T 
2025-05-07T20:26:40.7162696Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:26:40.7163094Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:26:40.7163396Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:40.7163706Z #define _Mlong_double_ long double
2025-05-07T20:26:40.7164006Z #define __cpp_lambdas 200907L
2025-05-07T20:26:40.7164276Z #define _IO_DEC 020
2025-05-07T20:26:40.7164517Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:26:40.7164800Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:26:40.7165107Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:26:40.7165406Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:26:40.7165678Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:26:40.7165992Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:26:40.7166364Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:26:40.7166671Z #define _ANSI_STDDEF_H 
2025-05-07T20:26:40.7166954Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:26:40.7167291Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:26:40.7167670Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:26:40.7168076Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:26:40.7168380Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:26:40.7168692Z #define __cpp_template_auto 201606L
2025-05-07T20:26:40.7169067Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:26:40.7169456Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:26:40.7169742Z #define __key_t_defined 
2025-05-07T20:26:40.7170003Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:26:40.7170392Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:26:40.7170884Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:26:40.7171268Z #define __GNUC_VA_LIST 
2025-05-07T20:26:40.7171619Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:40.7172030Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:26:40.7172310Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:26:40.7172601Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:26:40.7172911Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:26:40.7173183Z #define __WCOREFLAG 0x80
2025-05-07T20:26:40.7173449Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:26:40.7173778Z #define cudaEventDisableTiming 0x02
2025-05-07T20:26:40.7174075Z #define __LP64__ 1
2025-05-07T20:26:40.7174331Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:26:40.7174665Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:26:40.7174969Z #define _IO_off64_t __off64_t
2025-05-07T20:26:40.7175240Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7175518Z #define __time_t_defined 1
2025-05-07T20:26:40.7175787Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:26:40.7176148Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:26:40.7176532Z #define __USE_UNIX98 1
2025-05-07T20:26:40.7176790Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:26:40.7177080Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:26:40.7177360Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:26:40.7177680Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:26:40.7178131Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:26:40.7178401Z #define SEEK_CUR 1
2025-05-07T20:26:40.7178768Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:40.7179058Z #define _ASSERT_H 1
2025-05-07T20:26:40.7179648Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:26:40.7180306Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:26:40.7180601Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:26:40.7180867Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:26:40.7181153Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:26:40.7181446Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:26:40.7181841Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:26:40.7182267Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:26:40.7182953Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:26:40.7183639Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:26:40.7183950Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:26:40.7184324Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:26:40.7184722Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:26:40.7185010Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:26:40.7185305Z #define cudaArrayDefault 0x00
2025-05-07T20:26:40.7185603Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:26:40.7185914Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:26:40.7186238Z #define TLOSS 5
2025-05-07T20:26:40.7186503Z #define __ssize_t_defined 
2025-05-07T20:26:40.7186793Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:26:40.7187082Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:26:40.7187397Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:26:40.7187696Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:26:40.7187999Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:26:40.7188315Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:26:40.7188644Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:26:40.7188956Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:26:40.7189269Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:26:40.7189576Z #define __REGISTER_PREFIX__ 
2025-05-07T20:26:40.7189852Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:26:40.7190216Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:26:40.7190600Z #define _IOS_NOREPLACE 64
2025-05-07T20:26:40.7190856Z #define __cdecl 
2025-05-07T20:26:40.7191104Z #define cudaEventInterprocess 0x04
2025-05-07T20:26:40.7191453Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:26:40.7191802Z #define LOGIN_NAME_MAX 256
2025-05-07T20:26:40.7192071Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:26:40.7192360Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:26:40.7192679Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:26:40.7192968Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:26:40.7193302Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:26:40.7193788Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:26:40.7194224Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:40.7194682Z #define ADJ_NANO 0x2000
2025-05-07T20:26:40.7195012Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:26:40.7195409Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:26:40.7195721Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:26:40.7196020Z #define __FLT_DIG__ 6
2025-05-07T20:26:40.7196398Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:26:40.7196818Z #define __NO_INLINE__ 1
2025-05-07T20:26:40.7197141Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:26:40.7197508Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:26:40.7197786Z #define ADJ_STATUS 0x0010
2025-05-07T20:26:40.7198071Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:26:40.7198471Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:26:40.7198764Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:40.7199161Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:26:40.7199467Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:26:40.7199876Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:26:40.7200315Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:26:40.7200683Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:26:40.7201043Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:26:40.7201308Z #define MAX_CANON 255
2025-05-07T20:26:40.7201560Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:26:40.7201826Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:26:40.7202111Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:26:40.7202414Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:26:40.7202736Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:26:40.7203055Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:26:40.7203358Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:26:40.7203693Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:26:40.7204029Z #define __VERSION__ "11.4.0"
2025-05-07T20:26:40.7204314Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:26:40.7204626Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:26:40.7204929Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:26:40.7205229Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:26:40.7205566Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:26:40.7205872Z #define __UINT64_C(c) c ## UL
2025-05-07T20:26:40.7206153Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:26:40.7206425Z #define _SYS_TYPES_H 1
2025-05-07T20:26:40.7206678Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:26:40.7206957Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:26:40.7207226Z #define _SYS_CDEFS_H 1
2025-05-07T20:26:40.7207472Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:26:40.7207765Z #define __cpp_unicode_characters 201411L
2025-05-07T20:26:40.7208083Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:26:40.7208350Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:26:40.7208663Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:26:40.7208956Z #define FP_SUBNORMAL 3
2025-05-07T20:26:40.7209220Z #define cudaOccupancyDefault 0x00
2025-05-07T20:26:40.7209519Z #define _INITIALIZER_LIST 
2025-05-07T20:26:40.7209784Z #define _STDC_PREDEF_H 1
2025-05-07T20:26:40.7210066Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:26:40.7210368Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:26:40.7210646Z #define _IO_file_flags _flags
2025-05-07T20:26:40.7210926Z #define __USE_XOPEN2K8 1
2025-05-07T20:26:40.7211188Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:26:40.7211489Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:26:40.7211788Z #define HUGE 3.40282347e+38F
2025-05-07T20:26:40.7212069Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:26:40.7212477Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:26:40.7212895Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:26:40.7213221Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:26:40.7213511Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:26:40.7213793Z #define _BSD_SOURCE 1
2025-05-07T20:26:40.7214041Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:26:40.7214917Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:26:40.7215808Z #define __catch(X) catch(X)
2025-05-07T20:26:40.7216087Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:26:40.7216393Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:26:40.7216687Z #define __TIMER_T_TYPE void *
2025-05-07T20:26:40.7216958Z #define __STRING(x) #x
2025-05-07T20:26:40.7217211Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:26:40.7217499Z #define _T_PTRDIFF_ 
2025-05-07T20:26:40.7217763Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:26:40.7218184Z #define cudaEventWaitExternal 0x01
2025-05-07T20:26:40.7218480Z #define __unbounded 
2025-05-07T20:26:40.7218740Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:40.7219127Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:26:40.7219421Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:40.7219741Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:26:40.7220035Z #define __cpp_lib_is_final 201402L
2025-05-07T20:26:40.7220343Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:26:40.7220689Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:26:40.7221016Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:26:40.7221311Z #define __managed__ __location__(managed)
2025-05-07T20:26:40.7221633Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:26:40.7222054Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:26:40.7222492Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:26:40.7222770Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:26:40.7223172Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:26:40.7223597Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:26:40.7224222Z #define _SYS_SIZE_T_H 
2025-05-07T20:26:40.7224620Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:26:40.7224981Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:26:40.7225277Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:26:40.7225589Z #define _CRTIMP 
2025-05-07T20:26:40.7225828Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:26:40.7226162Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:40.7226547Z #define STA_PPSJITTER 0x0200
2025-05-07T20:26:40.7226950Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:26:40.7227382Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.7227725Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:26:40.7228028Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:26:40.7228338Z #define __SIZE_T__ 
2025-05-07T20:26:40.7228571Z #define __stub_gtty 
2025-05-07T20:26:40.7228820Z #define __pid_t_defined 
2025-05-07T20:26:40.7229101Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:26:40.7229426Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:40.7229765Z #define __glibcxx_function_requires(...) 
2025-05-07T20:26:40.7230082Z #define __SM_80_RT_HPP__ 
2025-05-07T20:26:40.7230343Z #define __need_clockid_t 
2025-05-07T20:26:40.7230611Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:26:40.7230889Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:26:40.7231225Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:26:40.7231565Z #define _IO_HEX 0100
2025-05-07T20:26:40.7231847Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:26:40.7232200Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:26:40.7232312Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:26:40.7232420Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:26:40.7232659Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:26:40.7232793Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:26:40.7232906Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:26:40.7233025Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:26:40.7233138Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:26:40.7233248Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:26:40.7233344Z #define __stub_sstk 
2025-05-07T20:26:40.7233447Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:26:40.7233675Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:26:40.7233774Z #define __wur 
2025-05-07T20:26:40.7233900Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:26:40.7233999Z #define _G_HAVE_MMAP 1
2025-05-07T20:26:40.7234089Z #define _IO_OCT 040
2025-05-07T20:26:40.7234192Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:26:40.7234295Z #define NL_MSGMAX INT_MAX
2025-05-07T20:26:40.7234394Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:26:40.7234530Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:26:40.7234892Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:26:40.7235006Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:26:40.7235326Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:26:40.7235436Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:26:40.7235534Z #define _STL_ALGOBASE_H 1
2025-05-07T20:26:40.7235653Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:26:40.7235755Z #define __off64_t_defined 
2025-05-07T20:26:40.7235863Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:26:40.7235962Z #define __FLT128_DIG__ 33
2025-05-07T20:26:40.7236075Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:26:40.7236180Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:26:40.7236276Z #define __INT32_C(c) c
2025-05-07T20:26:40.7236379Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:26:40.7236485Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:26:40.7236595Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:26:40.7236694Z #define __PDP_ENDIAN 3412
2025-05-07T20:26:40.7236791Z #define _ISOC95_SOURCE 1
2025-05-07T20:26:40.7236906Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:26:40.7237046Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:26:40.7237154Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:26:40.7237259Z #define __SM_90_RT_HPP__ 
2025-05-07T20:26:40.7237365Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:26:40.7237474Z #define __have_pthread_attr_t 1
2025-05-07T20:26:40.7237580Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:26:40.7237813Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:26:40.7237936Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:26:40.7238046Z #define __cudaCDP2EventRecord 
2025-05-07T20:26:40.7238148Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:26:40.7238247Z #define htole32(x) (x)
2025-05-07T20:26:40.7238510Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:26:40.7238641Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:26:40.7238754Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:26:40.7238927Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:26:40.7239087Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:26:40.7239221Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:26:40.7239368Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:26:40.7239473Z #define ADJ_OFFSET 0x0001
2025-05-07T20:26:40.7239580Z #define cudaArrayLayered 0x01
2025-05-07T20:26:40.7239759Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:26:40.7239883Z #define cudaEventRecordDefault 0x00
2025-05-07T20:26:40.7239984Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:26:40.7240090Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:26:40.7240186Z #define unix 1
2025-05-07T20:26:40.7240286Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:26:40.7240386Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:26:40.7240498Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:26:40.7240624Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:26:40.7240732Z #define __USE_POSIX 1
2025-05-07T20:26:40.7240835Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:26:40.7240981Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:26:40.7241085Z #define __THROWNL throw ()
2025-05-07T20:26:40.7241185Z #define __cpp_rtti 199711L
2025-05-07T20:26:40.7241300Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:26:40.7241404Z #define __PMT(args) args
2025-05-07T20:26:40.7241526Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.7241684Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:26:40.7241812Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:26:40.7241915Z #define _SIZE_T_DECLARED 
2025-05-07T20:26:40.7242024Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:26:40.7242123Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:26:40.7242536Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:26:40.7242791Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:26:40.7242892Z #define XATTR_LIST_MAX 65536
2025-05-07T20:26:40.7242996Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:26:40.7243235Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:26:40.7243328Z #define _WCHAR_T_H 
2025-05-07T20:26:40.7243424Z #define __FLT64X_DIG__ 18
2025-05-07T20:26:40.7243529Z #define _IO_SHOWBASE 0200
2025-05-07T20:26:40.7243626Z #define _POSIX_QLIMIT 1
2025-05-07T20:26:40.7243732Z #define __INT8_TYPE__ signed char
2025-05-07T20:26:40.7243839Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:26:40.7243934Z #define __CUDA_ARCH__ 520
2025-05-07T20:26:40.7244056Z #define __cpp_digit_separators 201309L
2025-05-07T20:26:40.7244145Z #define __ELF__ 1
2025-05-07T20:26:40.7244252Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:26:40.7244365Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:26:40.7244458Z #define STA_INS 0x0010
2025-05-07T20:26:40.7244563Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:26:40.7244757Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:26:40.7244858Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:26:40.7244966Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:26:40.7245094Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.7245210Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7245320Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:26:40.7245432Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:26:40.7245535Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:26:40.7245708Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:26:40.7245876Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:26:40.7245982Z #define _IO_funlockfile(_fp) 
2025-05-07T20:26:40.7246324Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:40.7246461Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:26:40.7246563Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:26:40.7246669Z #define __FLT_RADIX__ 2
2025-05-07T20:26:40.7246778Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:26:40.7246965Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:26:40.7247067Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:26:40.7247168Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:26:40.7247285Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:26:40.7247389Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:26:40.7247493Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:26:40.7247610Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:26:40.7247701Z #define WORD_BIT 32
2025-05-07T20:26:40.7247794Z #define _IO_USER_BUF 1
2025-05-07T20:26:40.7247902Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:26:40.7248016Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7248134Z #define cudaHostAllocPortable 0x01
2025-05-07T20:26:40.7248245Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:26:40.7248352Z #define __long_double_t long double
2025-05-07T20:26:40.7248465Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:26:40.7248563Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:26:40.7248986Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:26:40.7249082Z #define __k8 1
2025-05-07T20:26:40.7249289Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:26:40.7249469Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:26:40.7249602Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:26:40.7249710Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:26:40.7249816Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:26:40.7249931Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:26:40.7250032Z #define __blksize_t_defined 
2025-05-07T20:26:40.7250138Z #define _IO_SHOWPOINT 0400
2025-05-07T20:26:40.7250243Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:26:40.7250365Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:26:40.7250471Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:26:40.7250762Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:26:40.7250863Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:26:40.7251046Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:26:40.7251313Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:26:40.7251671Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:26:40.7251786Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:26:40.7251891Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:26:40.7251986Z #define SEEK_SET 0
2025-05-07T20:26:40.7252092Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:26:40.7252196Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:26:40.7252424Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:26:40.7252536Z #define __cudaCDP2GetLastError 
2025-05-07T20:26:40.7252638Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:26:40.7252742Z #define _MATH_H_MATHDEF 1
2025-05-07T20:26:40.7253087Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:26:40.7253198Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:26:40.7253311Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:26:40.7253408Z #define __stub_sigreturn 
2025-05-07T20:26:40.7253664Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:26:40.7253768Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:26:40.7253865Z #define __HOST_CONFIG_H__ 
2025-05-07T20:26:40.7253980Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:26:40.7254072Z #define CLOCK_TAI 11
2025-05-07T20:26:40.7254187Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:26:40.7254416Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:26:40.7254512Z #define __restrict_arr 
2025-05-07T20:26:40.7254632Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:26:40.7254796Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:26:40.7255345Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:40.7255546Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:26:40.7255639Z #define __USE_MISC 1
2025-05-07T20:26:40.7255751Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:26:40.7255865Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:26:40.7255961Z #define _GCC_LIMITS_H_ 
2025-05-07T20:26:40.7256054Z #define __LDBL_DIG__ 18
2025-05-07T20:26:40.7256170Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:26:40.7256279Z #define __malloc_and_calloc_defined 
2025-05-07T20:26:40.7256379Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:26:40.7256496Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:26:40.7256586Z #define __x86_64__ 1
2025-05-07T20:26:40.7257001Z #define _SIZE_T_ 
2025-05-07T20:26:40.7258004Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:26:40.7258177Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:26:40.7266889Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:26:40.7267054Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:26:40.7267197Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:26:40.7267306Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:26:40.7267427Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:26:40.7267570Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:26:40.7267722Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:26:40.7267830Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:26:40.7268582Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:26:40.7268717Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:26:40.7268882Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:26:40.7268993Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:26:40.7269101Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:26:40.7269205Z #define STA_FLL 0x0008
2025-05-07T20:26:40.7269360Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:26:40.7269465Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:26:40.7269603Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7269724Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:26:40.7269819Z #define __stub_revoke 
2025-05-07T20:26:40.7269926Z #define __timer_t_defined 1
2025-05-07T20:26:40.7270068Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:26:40.7270182Z #define INT_MAX __INT_MAX__
2025-05-07T20:26:40.7270298Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:26:40.7270418Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:26:40.7270534Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:26:40.7270650Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:26:40.7270769Z #define cudaArrayTextureGather 0x08
2025-05-07T20:26:40.7270888Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:26:40.7271047Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:26:40.7271151Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:26:40.7271260Z #define _IO_off_t __off_t
2025-05-07T20:26:40.7271355Z #define __FLT64_DIG__ 15
2025-05-07T20:26:40.7271590Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:26:40.7271703Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:26:40.7271839Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.7271979Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:26:40.7272089Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:26:40.7272200Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:26:40.7272301Z #define NULL __null
2025-05-07T20:26:40.7272451Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:26:40.7272566Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:26:40.7272682Z #define __U64_TYPE unsigned long int
2025-05-07T20:26:40.7272785Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7272886Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:26:40.7272984Z #define FP_ZERO 2
2025-05-07T20:26:40.7273089Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:26:40.7273259Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:26:40.7273376Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7273469Z #define __WCHAR_T__ 
2025-05-07T20:26:40.7273751Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:26:40.7273960Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:26:40.7274122Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:26:40.7274244Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:26:40.7274380Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:26:40.7274503Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:26:40.7274646Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:26:40.7274781Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:26:40.7274886Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:26:40.7274985Z #define _SIGSET_H_types 1
2025-05-07T20:26:40.7275107Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:26:40.7275228Z #define __cpp_unicode_literals 200710L
2025-05-07T20:26:40.7275385Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:26:40.7275498Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:26:40.7275630Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:26:40.7275773Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:26:40.7275895Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:26:40.7276127Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:26:40.7276248Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:26:40.7276516Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:26:40.7276622Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:26:40.7276742Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:26:40.7276847Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:26:40.7276942Z #define STA_MODE 0x4000
2025-05-07T20:26:40.7277065Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:26:40.7277176Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:26:40.7277299Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:26:40.7277413Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:26:40.7277516Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:26:40.7277629Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:26:40.7277736Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:26:40.7277856Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:26:40.7277960Z #define __SIZE_WIDTH__ 64
2025-05-07T20:26:40.7278091Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.7278186Z #define __SEG_FS 1
2025-05-07T20:26:40.7278288Z #define _IO_size_t size_t
2025-05-07T20:26:40.7278391Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:26:40.7278498Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:26:40.7278594Z #define __stub_lchmod 
2025-05-07T20:26:40.7278693Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:26:40.7278807Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7278918Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:26:40.7279006Z #define __SEG_GS 1
2025-05-07T20:26:40.7279200Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:26:40.7279300Z #define _IOS_APPEND 8
2025-05-07T20:26:40.7279402Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:26:40.7279501Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:26:40.7279614Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:26:40.7279725Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:26:40.7279839Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:26:40.7279932Z #define htole16(x) (x)
2025-05-07T20:26:40.7280054Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:40.7280162Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:26:40.7280263Z #define __INT16_TYPE__ short int
2025-05-07T20:26:40.7280374Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:26:40.7280495Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:26:40.7280612Z #define __cpp_structured_bindings 201606L
2025-05-07T20:26:40.7280744Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:26:40.7280850Z #define __SIZEOF_INT__ 4
2025-05-07T20:26:40.7280947Z #define __WCLONE 0x80000000
2025-05-07T20:26:40.7281047Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:26:40.7281142Z #define SEEK_HOLE 4
2025-05-07T20:26:40.7281237Z #define TIMER_ABSTIME 1
2025-05-07T20:26:40.7281349Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:26:40.7281447Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:26:40.7281636Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:40.7281762Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7281871Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:26:40.7281987Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:26:40.7282098Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7282228Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:26:40.7282325Z #define _LINUX_LIMITS_H 
2025-05-07T20:26:40.7282419Z #define linux 1
2025-05-07T20:26:40.7282517Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:26:40.7282641Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:26:40.7282748Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:26:40.7282849Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:26:40.7282969Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:26:40.7283124Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:26:40.7283229Z #define __cpp_lib_hypot 201603
2025-05-07T20:26:40.7283430Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7283537Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:26:40.7283633Z #define MOD_NANO ADJ_NANO
2025-05-07T20:26:40.7283807Z #define htole64(x) (x)
2025-05-07T20:26:40.7283917Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:26:40.7284051Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:26:40.7284159Z #define _IO_UPPERCASE 01000
2025-05-07T20:26:40.7284673Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:26:40.7284777Z #define __USE_POSIX2 1
2025-05-07T20:26:40.7284884Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:26:40.7284980Z #define __WALL 0x40000000
2025-05-07T20:26:40.7285092Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:26:40.7285183Z #define _XLOCALE_H 1
2025-05-07T20:26:40.7285285Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:26:40.7285396Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:26:40.7285497Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:26:40.7285615Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:26:40.7285719Z #define __EXCEPTIONS 1
2025-05-07T20:26:40.7285832Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:26:40.7286042Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:26:40.7286135Z #define __WORDSIZE 64
2025-05-07T20:26:40.7286235Z #define CLOCK_MONOTONIC 1
2025-05-07T20:26:40.7286338Z #define _STL_RELOPS_H 1
2025-05-07T20:26:40.7286439Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:26:40.7286545Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:26:40.7286657Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:26:40.7286757Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:26:40.7286863Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:26:40.7287186Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:26:40.7287430Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:26:40.7287560Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:26:40.7287678Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:26:40.7287789Z #define __cpp_range_based_for 201603L
2025-05-07T20:26:40.7287919Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:26:40.7288027Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:26:40.7288142Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:26:40.7288341Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:26:40.7288447Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:26:40.7288547Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:26:40.7288669Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:26:40.7288852Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:26:40.7288974Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:26:40.7289074Z #define _STRING_H 1
2025-05-07T20:26:40.7289180Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:26:40.7289281Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:26:40.7289386Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:26:40.7289536Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:26:40.7289643Z #define __code_model_small__ 1
2025-05-07T20:26:40.7289742Z #define _PSTL_CONFIG_H 
2025-05-07T20:26:40.7289850Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:26:40.7289978Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:26:40.7290080Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:26:40.7290188Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:26:40.7290547Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:26:40.7290647Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:26:40.7290745Z #define le64toh(x) (x)
2025-05-07T20:26:40.7290845Z #define FILENAME_MAX 4096
2025-05-07T20:26:40.7291003Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:26:40.7291130Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:26:40.7291220Z #define L_cuserid 9
2025-05-07T20:26:40.7291436Z #define __ino_t_defined 
2025-05-07T20:26:40.7291528Z #define __k8__ 1
2025-05-07T20:26:40.7291632Z #define __INTPTR_TYPE__ long int
2025-05-07T20:26:40.7291823Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:26:40.7291926Z #define __int8_t_defined 
2025-05-07T20:26:40.7292025Z #define __WCHAR_TYPE__ int
2025-05-07T20:26:40.7292132Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:26:40.7292259Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:26:40.7292363Z #define __SLONGWORD_TYPE long int
2025-05-07T20:26:40.7292498Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:26:40.7292655Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:26:40.7292748Z #define __HAVE_COLUMN 
2025-05-07T20:26:40.7292851Z #define __stub_fdetach 
2025-05-07T20:26:40.7293278Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:26:40.7293369Z #define __pic__ 2
2025-05-07T20:26:40.7293512Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.7293616Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:26:40.7293720Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:26:40.7293835Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:26:40.7293928Z #define __stub_chflags 
2025-05-07T20:26:40.7294030Z #define CLOCK_BOOTTIME 7
2025-05-07T20:26:40.7294121Z #define __need_IOV_MAX 
2025-05-07T20:26:40.7294236Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:26:40.7294355Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:26:40.7294460Z #define __cpp_decltype 200707L
2025-05-07T20:26:40.7294567Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:26:40.7294670Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:26:40.7294785Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:26:40.7294878Z #define TTY_NAME_MAX 32
2025-05-07T20:26:40.7295064Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:26:40.7295194Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7295380Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:26:40.7295506Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:26:40.7295614Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:26:40.7295720Z #define STA_PPSTIME 0x0004
2025-05-07T20:26:40.7295810Z #define __import__ 
2025-05-07T20:26:40.7295909Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:26:40.7296081Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:26:40.7296171Z #define __export__ 
2025-05-07T20:26:40.7296298Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:26:40.7296414Z #define cudaMemAttachHost 0x02
2025-05-07T20:26:40.7296584Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:40.7296688Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:26:40.7296793Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:26:40.7296899Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:26:40.7297003Z #define _WCHAR_T_DECLARED 
2025-05-07T20:26:40.7297130Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:26:40.7297262Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:26:40.7297382Z #define __cpp_inline_variables 201606L
2025-05-07T20:26:40.7297488Z #define WNOWAIT 0x01000000
2025-05-07T20:26:40.7297576Z #define PLOSS 6
2025-05-07T20:26:40.7297682Z #define M_LN10 2.30258509299404568402
2025-05-07T20:26:40.7297958Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:26:40.7298053Z #define EXIT_SUCCESS 0
2025-05-07T20:26:40.7298165Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:26:40.7298268Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:26:40.7298375Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:26:40.7298481Z #define __thread__ __thread
2025-05-07T20:26:40.7298585Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:26:40.7298690Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:26:40.7298801Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:26:40.7299040Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:40.7299267Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:26:40.7299369Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:26:40.7299529Z #define __linux__ 1
2025-05-07T20:26:40.7299640Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:26:40.7299774Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:26:40.7299874Z #define __S16_TYPE short int
2025-05-07T20:26:40.7300243Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:26:40.7300358Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:26:40.7300567Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:26:40.7300671Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:26:40.7300776Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:26:40.7300870Z #define _T_SIZE_ 
2025-05-07T20:26:40.7300975Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:26:40.7301103Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:26:40.7301218Z #define _PSTL_VERSION 12000
2025-05-07T20:26:40.7301346Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:26:40.7301453Z #define __WNOTHREAD 0x20000000
2025-05-07T20:26:40.7301562Z #define _G_va_list __gnuc_va_list
2025-05-07T20:26:40.7301699Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:26:40.7301796Z #define _IOS_INPUT 1
2025-05-07T20:26:40.7301895Z #define __USE_LARGEFILE64 1
2025-05-07T20:26:40.7302007Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:26:40.7302112Z #define __INT64_TYPE__ long int
2025-05-07T20:26:40.7302215Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:26:40.7302325Z #define __shared__ __location__(shared)
2025-05-07T20:26:40.7302428Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:26:40.7302593Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:26:40.7302687Z #define __gid_t_defined 
2025-05-07T20:26:40.7302813Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:26:40.7302922Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:26:40.7303130Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:26:40.7303248Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:26:40.7303346Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:26:40.7303445Z #define ___int_size_t_h 
2025-05-07T20:26:40.7303559Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:40.7303690Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:26:40.7303863Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:26:40.7303974Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:26:40.7304076Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:26:40.7304191Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:26:40.7304291Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:26:40.7304422Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7304548Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:26:40.7304676Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:26:40.7304783Z #define __clock_t_defined 1
2025-05-07T20:26:40.7304891Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:26:40.7305007Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:26:40.7305117Z #define __GLIBC_MINOR__ 17
2025-05-07T20:26:40.7305216Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:26:40.7305320Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:26:40.7305441Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:26:40.7305538Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:26:40.7305717Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:26:40.7305809Z #define __SSE__ 1
2025-05-07T20:26:40.7305913Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:26:40.7306014Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:26:40.7306111Z #define _CTYPE_H 1
2025-05-07T20:26:40.7306210Z #define __sigset_t_defined 
2025-05-07T20:26:40.7306319Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:26:40.7306421Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:26:40.7306515Z #define MOD_TAI ADJ_TAI
2025-05-07T20:26:40.7306719Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:26:40.7306819Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:26:40.7306983Z #define __SM_70_RT_H__ 
2025-05-07T20:26:40.7307091Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:26:40.7307202Z #define cudaEventWaitDefault 0x00
2025-05-07T20:26:40.7307305Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:26:40.7307479Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:40.7307580Z #define _POSIX_MAX_CANON 255
2025-05-07T20:26:40.7307698Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:26:40.7307805Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:26:40.7307903Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:26:40.7307998Z #define __amd64__ 1
2025-05-07T20:26:40.7308094Z #define __WINT_WIDTH__ 32
2025-05-07T20:26:40.7308205Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:26:40.7308492Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:40.7308604Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:26:40.7308693Z #define EOF (-1)
2025-05-07T20:26:40.7308803Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:26:40.7308908Z #define __USE_POSIX199309 1
2025-05-07T20:26:40.7309010Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:26:40.7309122Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:26:40.7309222Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:26:40.7309332Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:26:40.7309453Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:26:40.7309553Z #define ____mbstate_t_defined 1
2025-05-07T20:26:40.7309655Z #define STA_NANO 0x2000
2025-05-07T20:26:40.7309757Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:26:40.7309857Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:26:40.7309953Z #define _IO_LINKED 0x80
2025-05-07T20:26:40.7310081Z #define __cpp_lib_launder 201606
2025-05-07T20:26:40.7310179Z #define __SIZEOF_INT128__ 16
2025-05-07T20:26:40.7310288Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:26:40.7310400Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:26:40.7310502Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:26:40.7310653Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:26:40.7310780Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7310889Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:26:40.7310997Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:26:40.7311098Z #define __W_CONTINUED 0xffff
2025-05-07T20:26:40.7311195Z #define __ATOMIC_RELAXED 0
2025-05-07T20:26:40.7311343Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:26:40.7311473Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:26:40.7311685Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:26:40.7311887Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:26:40.7311981Z #define __stub_stty 
2025-05-07T20:26:40.7312157Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:26:40.7312258Z #define le16toh(x) (x)
2025-05-07T20:26:40.7312378Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:26:40.7312562Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:26:40.7312662Z #define _SIZET_ 
2025-05-07T20:26:40.7312761Z #define XATTR_NAME_MAX 255
2025-05-07T20:26:40.7312861Z #define _SVID_SOURCE 1
2025-05-07T20:26:40.7312950Z #define _LP64 1
2025-05-07T20:26:40.7313048Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:26:40.7313300Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:26:40.7313420Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:26:40.7313515Z #define __UINT8_C(c) c
2025-05-07T20:26:40.7313762Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:26:40.7313863Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:26:40.7313979Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:26:40.7314086Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:26:40.7314186Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:26:40.7314296Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:26:40.7314480Z #define CUDARTAPI 
2025-05-07T20:26:40.7314570Z #define IOV_MAX 1024
2025-05-07T20:26:40.7314803Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:26:40.7314908Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:26:40.7315010Z #define P_tmpdir "/tmp"
2025-05-07T20:26:40.7315125Z #define cudaMemAttachSingle 0x04
2025-05-07T20:26:40.7315213Z #define __wchar_t__ 
2025-05-07T20:26:40.7315322Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:26:40.7315416Z #define SEEK_END 2
2025-05-07T20:26:40.7315515Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:26:40.7315695Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:26:40.7315806Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:26:40.7315958Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:26:40.7316063Z #define ____FILE_defined 1
2025-05-07T20:26:40.7316189Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:26:40.7316291Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:26:40.7316397Z #define _ISOC99_SOURCE 1
2025-05-07T20:26:40.7316499Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:26:40.7316767Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:26:40.7316915Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:26:40.7317004Z #define _IO_RIGHT 04
2025-05-07T20:26:40.7317106Z #define __END_NAMESPACE_STD 
2025-05-07T20:26:40.7317305Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:26:40.7317406Z #define _GLIBCXX_STD_C std
2025-05-07T20:26:40.7317539Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:26:40.7317642Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:26:40.7317750Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:26:40.7317845Z #define _STDDEF_H_ 
2025-05-07T20:26:40.7318026Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:26:40.7318130Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7318263Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:26:40.7318476Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:26:40.7318599Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:26:40.7318758Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:26:40.7318890Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:26:40.7319006Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:26:40.7319122Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:26:40.7319222Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:26:40.7319349Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:26:40.7319452Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:26:40.7319552Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:26:40.7319664Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:26:40.7319845Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:26:40.7319945Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:26:40.7320140Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:26:40.7320252Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:26:40.7320353Z #define __STDCPP_THREADS__ 1
2025-05-07T20:26:40.7320518Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:26:40.7320620Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:26:40.7320725Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:26:40.7320831Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:26:40.7320958Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:26:40.7321065Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:26:40.7321174Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:26:40.7321348Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:26:40.7321533Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:26:40.7321639Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:26:40.7321766Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:26:40.7321893Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:26:40.7322129Z #define __location__(a) __annotate__(a)
2025-05-07T20:26:40.7322451Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:26:40.7322556Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:26:40.7322679Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:26:40.7322786Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:26:40.7322882Z #define __STDC_UTF_32__ 1
2025-05-07T20:26:40.7322982Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:26:40.7323093Z #define NAN (__builtin_nanf (""))
2025-05-07T20:26:40.7323196Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:26:40.7323285Z #define __FXSR__ 1
2025-05-07T20:26:40.7323379Z #define _SIZE_T 
2025-05-07T20:26:40.7323491Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:26:40.7323617Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:26:40.7324070Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:40.7324306Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:26:40.7324478Z #define _IO_ssize_t __ssize_t
2025-05-07T20:26:40.7324589Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:26:40.7324790Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:26:40.7325009Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:26:40.7325105Z #define _GXX_NULLPTR_T 
2025-05-07T20:26:40.7325236Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:26:40.7325335Z #define FOPEN_MAX 16
2025-05-07T20:26:40.7325433Z #define __BIG_ENDIAN 4321
2025-05-07T20:26:40.7325567Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:26:40.7325671Z #define __suseconds_t_defined 
2025-05-07T20:26:40.7325765Z #define __off_t_defined 
2025-05-07T20:26:40.7325865Z #define stderr stderr
2025-05-07T20:26:40.7325967Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:26:40.7326086Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:26:40.7326197Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:26:40.7326299Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:26:40.7326725Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:26:40.7326828Z #define __mode_t_defined 
2025-05-07T20:26:40.7326919Z #define _GCC_SIZE_T 
2025-05-07T20:26:40.7327025Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:40.7327142Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:26:40.7327256Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:26:40.7327365Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:26:40.7327463Z #define __UINT32_C(c) c ## U
2025-05-07T20:26:40.7327573Z #define __cpp_alias_templates 200704L
2025-05-07T20:26:40.7327692Z #define cudaHostAllocMapped 0x02
2025-05-07T20:26:40.7327804Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:26:40.7327901Z #define _STL_ITERATOR_H 1
2025-05-07T20:26:40.7327995Z #define __size_t__ 
2025-05-07T20:26:40.7328136Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:26:40.7328242Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:26:40.7328363Z #define cudaEventRecordExternal 0x01
2025-05-07T20:26:40.7328532Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:26:40.7328637Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:26:40.7328814Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:26:40.7328906Z #define _ENDIAN_H 1
2025-05-07T20:26:40.7329023Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:26:40.7329126Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:26:40.7329234Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:26:40.7329324Z #define __try try
2025-05-07T20:26:40.7329426Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:26:40.7329526Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:26:40.7329630Z #define __INT8_MAX__ 0x7f
2025-05-07T20:26:40.7329900Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:26:40.7329996Z #define __LONG_WIDTH__ 64
2025-05-07T20:26:40.7330338Z #define __PIC__ 2
2025-05-07T20:26:40.7330458Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:26:40.7330592Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:26:40.7332535Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:26:40.7332655Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:26:40.7332762Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:26:40.7332957Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:26:40.7333064Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:26:40.7333180Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:26:40.7333276Z #define _IO_uid_t __uid_t
2025-05-07T20:26:40.7333382Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:26:40.7333526Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:26:40.7333630Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:26:40.7333791Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:26:40.7333900Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:26:40.7334038Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:26:40.7334134Z #define LONG_BIT 64
2025-05-07T20:26:40.7334250Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:26:40.7334364Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:26:40.7334508Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:26:40.7334610Z #define __fsfilcnt_t_defined 
2025-05-07T20:26:40.7334708Z #define __blkcnt_t_defined 
2025-05-07T20:26:40.7334997Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:26:40.7335095Z #define __USE_LARGEFILE 1
2025-05-07T20:26:40.7335206Z #define __cpp_constexpr 201603L
2025-05-07T20:26:40.7335307Z #define CUDART_VERSION 12080
2025-05-07T20:26:40.7335405Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:26:40.7335520Z #define cudaDeviceMapHost 0x08
2025-05-07T20:26:40.7335614Z #define _GLIBCXX_CMATH 1
2025-05-07T20:26:40.7335828Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:26:40.7335938Z #define __lldiv_t_defined 1
2025-05-07T20:26:40.7336025Z #define __SSE2__ 1
2025-05-07T20:26:40.7336111Z #define _IOLBF 1
2025-05-07T20:26:40.7336229Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:26:40.7336333Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:26:40.7336444Z #define __cpp_deduction_guides 201703L
2025-05-07T20:26:40.7336553Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:26:40.7336673Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:26:40.7336776Z #define __INT32_TYPE__ int
2025-05-07T20:26:40.7336876Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:26:40.7336990Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:26:40.7337103Z #define __cpp_exceptions 199711L
2025-05-07T20:26:40.7337206Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:26:40.7337324Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:26:40.7337432Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:26:40.7337557Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:26:40.7337726Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:26:40.7337841Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:26:40.7337943Z #define __SWORD_TYPE long int
2025-05-07T20:26:40.7338052Z #define __INTMAX_TYPE__ long int
2025-05-07T20:26:40.7338161Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:26:40.7338264Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:26:40.7338372Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:26:40.7338666Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:26:40.7338768Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:26:40.7338931Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:26:40.7339019Z #define _T_SIZE 
2025-05-07T20:26:40.7339132Z #define cudaHostAllocDefault 0x00
2025-05-07T20:26:40.7339274Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:26:40.7339406Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:26:40.7339508Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:26:40.7339615Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:26:40.7339837Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:26:40.7339949Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7340047Z #define __ATOMIC_CONSUME 1
2025-05-07T20:26:40.7340312Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:26:40.7340416Z #define __GNUC_MINOR__ 4
2025-05-07T20:26:40.7340526Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:26:40.7340628Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:26:40.7340761Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.7340850Z #define __PIE__ 2
2025-05-07T20:26:40.7340960Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:26:40.7341075Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:26:40.7341279Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:26:40.7341511Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:40.7341616Z #define __nlink_t_defined 
2025-05-07T20:26:40.7341750Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:26:40.7341882Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:26:40.7341976Z #define _XOPEN_LIM_H 1
2025-05-07T20:26:40.7342254Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:26:40.7342388Z #define __cpp_template_template_args 201611L
2025-05-07T20:26:40.7342500Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:26:40.7342609Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:26:40.7342715Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:26:40.7342810Z #define __FILE_defined 1
2025-05-07T20:26:40.7342997Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:26:40.7343106Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:26:40.7343209Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:26:40.7343330Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:26:40.7343455Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:26:40.7343570Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:26:40.7343693Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:26:40.7343785Z #define __INT16_C(c) c
2025-05-07T20:26:40.7343891Z #define __U32_TYPE unsigned int
2025-05-07T20:26:40.7344003Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:26:40.7344134Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:26:40.7344223Z #define __STDC__ 1
2025-05-07T20:26:40.7344337Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:26:40.7344444Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:26:40.7344553Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:26:40.7344714Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:26:40.7344810Z #define __FLT32X_DIG__ 15
2025-05-07T20:26:40.7344921Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:26:40.7345026Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:26:40.7345147Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:26:40.7345268Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:26:40.7345372Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:26:40.7345488Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:26:40.7345585Z #define stdin stdin
2025-05-07T20:26:40.7345687Z #define __ino64_t_defined 
2025-05-07T20:26:40.7345780Z #define STA_CLK 0x8000
2025-05-07T20:26:40.7345885Z #define __clockid_t_defined 1
2025-05-07T20:26:40.7346042Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:26:40.7346222Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:26:40.7346332Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:26:40.7346441Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:26:40.7346560Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:26:40.7346670Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:26:40.7346877Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:26:40.7346982Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:26:40.7347531Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:26:40.7347826Z #define DOMAIN 1
2025-05-07T20:26:40.7347929Z #define M_LN2 0.69314718055994530942
2025-05-07T20:26:40.7348019Z #define __NVCC__ 1
2025-05-07T20:26:40.7348137Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:26:40.7348258Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:26:40.7348367Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:26:40.7348486Z #define __throw_exception_again throw
2025-05-07T20:26:40.7348585Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:26:40.7348679Z #define __EXCEPTION_H 1
2025-05-07T20:26:40.7348789Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:26:40.7348899Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:26:40.7349224Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:26:40.7349345Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:26:40.7349458Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:26:40.7349570Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:26:40.7349686Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:26:40.7349789Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:26:40.7349946Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:26:40.7350060Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:26:40.7350178Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:26:40.7350285Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:26:40.7350399Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:26:40.7350503Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:26:40.7350621Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:26:40.7350766Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:26:40.7350875Z #define __useconds_t_defined 
2025-05-07T20:26:40.7350982Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:26:40.7351174Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:26:40.7351344Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:26:40.7351438Z #define __SSE_MATH__ 1
2025-05-07T20:26:40.7351543Z #define _IO_wint_t wint_t
2025-05-07T20:26:40.7351653Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:26:40.7351753Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:26:40.7351854Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:26:40.7351980Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:26:40.7352084Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:26:40.7352194Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:26:40.7352287Z #define __USE_ATFILE 1
2025-05-07T20:26:40.7352388Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:26:40.7352499Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:26:40.7352592Z #define _GCC_PTRDIFF_T 
2025-05-07T20:26:40.7352831Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:26:40.7352945Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:26:40.7353051Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:26:40.7353165Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:26:40.7353291Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:26:40.7353385Z #define _STDLIB_H 1
2025-05-07T20:26:40.7353642Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:26:40.7353754Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:26:40.7353854Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:26:40.7353995Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:26:40.7354112Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:26:40.7354213Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:26:40.7354413Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:26:40.7354578Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:26:40.7354690Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:26:40.7354820Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:26:40.7354917Z #define __ldiv_t_defined 1
2025-05-07T20:26:40.7355196Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:26:40.7355303Z #define ___int_ptrdiff_t_h 
2025-05-07T20:26:40.7355558Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:26:40.7355677Z #define __cudaCDP2EventDestroy 
2025-05-07T20:26:40.7355776Z #define __HOST_DEFINES_H__ 
2025-05-07T20:26:40.7355884Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:26:40.7355999Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:26:40.7356104Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:26:40.7356193Z #define CUDART_CB 
2025-05-07T20:26:40.7356309Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:26:40.7356443Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:26:40.7356535Z #define MB_LEN_MAX 16
2025-05-07T20:26:40.7356780Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:26:40.7356885Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:26:40.7357023Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:26:40.7357150Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:26:40.7357257Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:26:40.7357427Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:26:40.7357541Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:26:40.7357633Z #define _GNU_SOURCE 1
2025-05-07T20:26:40.7357731Z #define __stub_putmsg 
2025-05-07T20:26:40.7357821Z #define __CUDACC__ 1
2025-05-07T20:26:40.7357916Z #define __N(msgid) (msgid)
2025-05-07T20:26:40.7358013Z #define __P(args) args
2025-05-07T20:26:40.7358279Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:26:40.7358389Z #define __cpp_init_captures 201304L
2025-05-07T20:26:40.7358507Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:26:40.7358603Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:26:40.7358717Z #define __cpp_lib_as_const 201510
2025-05-07T20:26:40.7358807Z #define __WCHAR_T 
2025-05-07T20:26:40.7358904Z #define __ATOMIC_RELEASE 3
2025-05-07T20:26:40.7359017Z #define __fsblkcnt_t_defined 
2025-05-07T20:26:40.7359141Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:26:40.7359256Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:26:40.7359264Z 
2025-05-07T20:26:40.7529528Z 
2025-05-07T20:26:40.7530002Z + conda run -n build_binary nvcc --version
2025-05-07T20:26:40.7530015Z 
2025-05-07T20:26:42.6577450Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:26:42.6577862Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:26:42.6578205Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:26:42.6578540Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:26:42.6578900Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:26:42.6579120Z 
2025-05-07T20:26:42.7252257Z 
2025-05-07T20:26:42.7262898Z /usr/bin/nvidia-smi
2025-05-07T20:26:42.7268539Z + nvidia-smi
2025-05-07T20:26:42.7268693Z 
2025-05-07T20:26:42.7444278Z Wed May  7 20:26:42 2025       
2025-05-07T20:26:42.7444690Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:42.7445261Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:26:42.7445810Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:42.7446344Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:26:42.7446966Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:26:42.7447434Z |                                         |                        |               MIG M. |
2025-05-07T20:26:42.7447803Z |=========================================+========================+======================|
2025-05-07T20:26:42.7613062Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:26:42.7613545Z |  0%   27C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:26:42.7614294Z |                                         |                        |                  N/A |
2025-05-07T20:26:42.7614860Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:26:42.7617746Z                                                                                          
2025-05-07T20:26:42.7618181Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:42.7618658Z | Processes:                                                                              |
2025-05-07T20:26:42.7619144Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:26:42.7619593Z |        ID   ID                                                               Usage      |
2025-05-07T20:26:42.7619980Z |=========================================================================================|
2025-05-07T20:26:42.7622683Z |  No running processes found                                                             |
2025-05-07T20:26:42.7623215Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:26:43.0119634Z 
2025-05-07T20:26:43.0126007Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:26:43.0182068Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:26:43.0182691Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:26:43.0196177Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:26:43.0196580Z env:
2025-05-07T20:26:43.0196846Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:26:43.0197194Z   BUILD_ENV: build_binary
2025-05-07T20:26:43.0197483Z   BUILD_TARGET: genai
2025-05-07T20:26:43.0197756Z   BUILD_VARIANT: cuda
2025-05-07T20:26:43.0198026Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:26:43.0198327Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:26:43.0198679Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:26:43.0199077Z ##[endgroup]
2025-05-07T20:26:43.3924326Z ################################################################################
2025-05-07T20:26:43.3924888Z # Install PyTorch (PIP)
2025-05-07T20:26:43.3925263Z #
2025-05-07T20:26:43.3942436Z # [2025-05-07T20:26:43.393Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:26:43.3943138Z ################################################################################
2025-05-07T20:26:43.3943468Z 
2025-05-07T20:26:43.3972949Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:26:44.5079952Z Channels:
2025-05-07T20:26:44.5080231Z  - conda-forge
2025-05-07T20:26:44.5080497Z Platform: linux-64
2025-05-07T20:26:47.8817110Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:26:48.6051672Z Solving environment: \ | / done
2025-05-07T20:26:48.8281083Z 
2025-05-07T20:26:48.8281355Z ## Package Plan ##
2025-05-07T20:26:48.8281562Z 
2025-05-07T20:26:48.8281781Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:48.8282101Z 
2025-05-07T20:26:48.8282211Z   added / updated specs:
2025-05-07T20:26:48.8282465Z     - numpy
2025-05-07T20:26:48.8282597Z 
2025-05-07T20:26:48.8282614Z 
2025-05-07T20:26:48.8282741Z The following packages will be downloaded:
2025-05-07T20:26:48.8282967Z 
2025-05-07T20:26:48.8283100Z     package                    |            build
2025-05-07T20:26:48.8283432Z     ---------------------------|-----------------
2025-05-07T20:26:48.8283836Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:26:48.8284316Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:26:48.8284793Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:26:48.8285262Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:26:48.8285743Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:26:48.8286609Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:26:48.8287083Z     numpy-2.2.5                |  py310hefbff90_0         7.6 MB  conda-forge
2025-05-07T20:26:48.8287493Z     ------------------------------------------------------------
2025-05-07T20:26:48.8287852Z                                            Total:        14.8 MB
2025-05-07T20:26:48.8288071Z 
2025-05-07T20:26:48.8288216Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:48.8288446Z 
2025-05-07T20:26:48.8288681Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:26:48.8289201Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:26:48.8289729Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:26:48.8290255Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:26:48.8290801Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:26:48.8291362Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:26:48.8292102Z   numpy              conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 
2025-05-07T20:26:48.8292388Z 
2025-05-07T20:26:48.8292393Z 
2025-05-07T20:26:48.8292397Z 
2025-05-07T20:26:48.8292574Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:48.8292962Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:48.8293192Z 
2025-05-07T20:26:48.8300752Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:48.8301013Z 
2025-05-07T20:26:48.8301017Z 
2025-05-07T20:26:48.8311482Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:26:48.8311749Z 
2025-05-07T20:26:48.8311920Z 
2025-05-07T20:26:48.8316715Z 
2025-05-07T20:26:48.8321615Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:26:48.8321923Z 
2025-05-07T20:26:48.8321928Z 
2025-05-07T20:26:48.8321932Z 
2025-05-07T20:26:48.8321936Z 
2025-05-07T20:26:48.8337931Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:26:48.8338239Z 
2025-05-07T20:26:48.8338259Z 
2025-05-07T20:26:48.8338264Z 
2025-05-07T20:26:48.8338269Z 
2025-05-07T20:26:48.8351747Z 
2025-05-07T20:26:48.8353468Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:26:48.8353834Z 
2025-05-07T20:26:48.8353838Z 
2025-05-07T20:26:48.8353842Z 
2025-05-07T20:26:48.8353845Z 
2025-05-07T20:26:48.8353849Z 
2025-05-07T20:26:48.8353853Z 
2025-05-07T20:26:48.9453252Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:48.9453537Z 
2025-05-07T20:26:48.9453541Z 
2025-05-07T20:26:48.9453545Z 
2025-05-07T20:26:48.9453548Z 
2025-05-07T20:26:49.0283938Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:26:49.0284215Z 
2025-05-07T20:26:49.0284220Z 
2025-05-07T20:26:49.0335391Z 
2025-05-07T20:26:49.0840442Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:26:49.0840769Z 
2025-05-07T20:26:49.0840775Z 
2025-05-07T20:26:49.0840781Z 
2025-05-07T20:26:49.0840786Z 
2025-05-07T20:26:49.0882434Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:49.0882823Z 
2025-05-07T20:26:49.0882829Z 
2025-05-07T20:26:49.0886219Z 
2025-05-07T20:26:49.1161086Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:49.1161518Z 
2025-05-07T20:26:49.1161524Z 
2025-05-07T20:26:49.1161529Z 
2025-05-07T20:26:49.1161535Z 
2025-05-07T20:26:49.1217625Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:26:49.1218017Z 
2025-05-07T20:26:49.1218023Z 
2025-05-07T20:26:49.1218028Z 
2025-05-07T20:26:49.1271276Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:26:49.1271682Z 
2025-05-07T20:26:49.1271688Z 
2025-05-07T20:26:49.1271694Z 
2025-05-07T20:26:49.1271965Z 
2025-05-07T20:26:49.1271970Z 
2025-05-07T20:26:49.1275458Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:49.1275862Z 
2025-05-07T20:26:49.1275867Z 
2025-05-07T20:26:49.1275873Z 
2025-05-07T20:26:49.1275878Z 
2025-05-07T20:26:49.1275894Z 
2025-05-07T20:26:49.1370287Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:49.1370673Z 
2025-05-07T20:26:49.1370678Z 
2025-05-07T20:26:49.1370681Z 
2025-05-07T20:26:49.1370685Z 
2025-05-07T20:26:49.1370689Z 
2025-05-07T20:26:49.1406766Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:49.1407167Z 
2025-05-07T20:26:49.1678837Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:26:49.1679218Z 
2025-05-07T20:26:49.1679224Z 
2025-05-07T20:26:49.1679229Z 
2025-05-07T20:26:49.1679234Z 
2025-05-07T20:26:49.1679240Z 
2025-05-07T20:26:49.1679245Z 
2025-05-07T20:26:49.1686663Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:26:49.1687096Z 
2025-05-07T20:26:49.1687101Z 
2025-05-07T20:26:49.1687105Z 
2025-05-07T20:26:49.1687108Z 
2025-05-07T20:26:49.1687112Z 
2025-05-07T20:26:49.1687588Z 
2025-05-07T20:26:49.1719908Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:49.1720225Z 
2025-05-07T20:26:49.1721672Z 
2025-05-07T20:26:49.1869734Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:26:49.1870119Z 
2025-05-07T20:26:49.1870123Z 
2025-05-07T20:26:49.1870127Z 
2025-05-07T20:26:49.1870130Z 
2025-05-07T20:26:49.1870134Z 
2025-05-07T20:26:49.1870138Z 
2025-05-07T20:26:49.1917515Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:49.1918198Z 
2025-05-07T20:26:49.2026036Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:49.2281519Z numpy-2.2.5          | 7.6 MB    |            |   0% 
2025-05-07T20:26:49.2281894Z 
2025-05-07T20:26:49.2283527Z 
2025-05-07T20:26:49.3149078Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:49.3152856Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:49.3364569Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:49.3364822Z 
2025-05-07T20:26:49.3365239Z 
2025-05-07T20:26:49.3369171Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:49.3369470Z 
2025-05-07T20:26:49.3369474Z 
2025-05-07T20:26:49.3922440Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:26:49.3922740Z 
2025-05-07T20:26:49.3925442Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:49.3925711Z 
2025-05-07T20:26:49.7561984Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:26:49.7569095Z numpy-2.2.5          | 7.6 MB    | ########## | 100% 
2025-05-07T20:26:49.7569666Z                                                      
2025-05-07T20:26:49.7569988Z 
2025-05-07T20:26:49.7570305Z                                                      [A
2025-05-07T20:26:49.7570642Z 
2025-05-07T20:26:49.7570648Z 
2025-05-07T20:26:49.7570900Z                                                      [A[A
2025-05-07T20:26:49.7571199Z 
2025-05-07T20:26:49.7571204Z 
2025-05-07T20:26:49.7571210Z 
2025-05-07T20:26:49.7571498Z                                                      [A[A[A
2025-05-07T20:26:49.7571806Z 
2025-05-07T20:26:49.7571812Z 
2025-05-07T20:26:49.7571817Z 
2025-05-07T20:26:49.7571823Z 
2025-05-07T20:26:49.7572086Z                                                      [A[A[A[A
2025-05-07T20:26:49.7572412Z 
2025-05-07T20:26:49.7572417Z 
2025-05-07T20:26:49.7572423Z 
2025-05-07T20:26:49.7572428Z 
2025-05-07T20:26:49.7572433Z 
2025-05-07T20:26:49.7572694Z                                                      [A[A[A[A[A
2025-05-07T20:26:49.7572986Z 
2025-05-07T20:26:49.7572990Z 
2025-05-07T20:26:49.7572993Z 
2025-05-07T20:26:49.7572997Z 
2025-05-07T20:26:49.7573001Z 
2025-05-07T20:26:49.7573004Z 
2025-05-07T20:26:49.7573207Z                                                      [A[A[A[A[A[A done
2025-05-07T20:26:49.8579639Z Preparing transaction: \ done
2025-05-07T20:26:49.9584337Z Verifying transaction: / done
2025-05-07T20:26:50.0593607Z Executing transaction: \ done
2025-05-07T20:26:50.2430316Z ################################################################################
2025-05-07T20:26:50.2430801Z # Install Package From PyTorch PIP: torch
2025-05-07T20:26:50.2431125Z #
2025-05-07T20:26:50.2446244Z # [2025-05-07T20:26:50.244Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:26:50.2446859Z ################################################################################
2025-05-07T20:26:50.2447093Z 
2025-05-07T20:26:50.2461967Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:26:50.3403160Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:26:50.3403690Z ################################################################################
2025-05-07T20:26:50.3404183Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:26:50.3404608Z #
2025-05-07T20:26:50.3421716Z # [2025-05-07T20:26:50.341Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:26:50.3422544Z ################################################################################
2025-05-07T20:26:50.3422806Z 
2025-05-07T20:26:50.3445584Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:26:50.3473101Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:26:50.3490721Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:26:50.3491301Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:26:50.3500303Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:26:50.3510133Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:26:50.3533445Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:29.3009850Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:29.3010486Z Collecting torch
2025-05-07T20:28:29.3011260Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:29.3012005Z Collecting filelock (from torch)
2025-05-07T20:28:29.3012530Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:29.3013506Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2)
2025-05-07T20:28:29.3014350Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:29.3014870Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:29.3015743Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 185.3 MB/s eta 0:00:00
2025-05-07T20:28:29.3016128Z Collecting networkx (from torch)
2025-05-07T20:28:29.3016653Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:29.3017329Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 127.8 MB/s eta 0:00:00
2025-05-07T20:28:29.3017701Z Collecting jinja2 (from torch)
2025-05-07T20:28:29.3018204Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:29.3018726Z Collecting fsspec (from torch)
2025-05-07T20:28:29.3019249Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:29.3019853Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:28:29.3020723Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:29.3021587Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:28:29.3022961Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:29.3024136Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:28:29.3025002Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:29.3025828Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:28:29.3026565Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:28:29.3027331Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:28:29.3028120Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:29.3028967Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:28:29.3029790Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:29.3030836Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:28:29.3031587Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:29.3032337Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:28:29.3033101Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:29.3033960Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:28:29.3034796Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:29.3035642Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:29.3036404Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:28:29.3037139Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:29.3037927Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:29.3038722Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:28:29.3039519Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:29.3040335Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:28:29.3041164Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:29.3042000Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:28:29.3042827Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:29.3043667Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:29.3044525Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:29.3045841Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1)
2025-05-07T20:28:29.3046732Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:29.3047308Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:29.3048219Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3048614Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:29.3049359Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
2025-05-07T20:28:29.3050455Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp310-cp310-manylinux_2_28_x86_64.whl (1047.1 MB)
2025-05-07T20:28:29.3051281Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 20.7 MB/s eta 0:00:00
2025-05-07T20:28:29.3052006Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:28:29.3052811Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 50.8 MB/s eta 0:00:00
2025-05-07T20:28:29.3053615Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:28:29.3054514Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 160.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3055438Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:28:29.3056361Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 135.1 MB/s eta 0:00:00
2025-05-07T20:28:29.3057167Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:28:29.3058109Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 88.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3058814Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:28:29.3059610Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 41.8 MB/s eta 0:00:00
2025-05-07T20:28:29.3060414Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:28:29.3061297Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 115.3 MB/s eta 0:00:00
2025-05-07T20:28:29.3062088Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:28:29.3062953Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 62.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3063664Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:28:29.3064454Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 149.9 MB/s eta 0:00:00
2025-05-07T20:28:29.3065174Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:28:29.3066090Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 118.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3066903Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:28:29.3067786Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 106.9 MB/s eta 0:00:00
2025-05-07T20:28:29.3068508Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:29.3069313Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.2 MB/s eta 0:00:00
2025-05-07T20:28:29.3070094Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:29.3070948Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 131.4 MB/s eta 0:00:00
2025-05-07T20:28:29.3071762Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:28:29.3072758Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 162.3 MB/s eta 0:00:00
2025-05-07T20:28:29.3073638Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:28:29.3074817Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB)
2025-05-07T20:28:29.3075721Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 128.7 MB/s eta 0:00:00
2025-05-07T20:28:29.3077494Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:29.3079315Z 
2025-05-07T20:28:29.3081346Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:28:29.3083445Z 
2025-05-07T20:28:31.5405337Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:28:31.5407915Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:28:35.2935714Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:38.8390216Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:38.8390758Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:42.3805474Z True
2025-05-07T20:28:42.3805738Z True
2025-05-07T20:28:42.3805853Z 
2025-05-07T20:28:42.4426064Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:42.4464459Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:42.4465099Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:42.4478262Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:42.4478631Z env:
2025-05-07T20:28:42.4478876Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:42.4479192Z   BUILD_ENV: build_binary
2025-05-07T20:28:42.4479460Z   BUILD_TARGET: genai
2025-05-07T20:28:42.4479884Z   BUILD_VARIANT: cuda
2025-05-07T20:28:42.4480134Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:42.4480411Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:42.4480735Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:42.4481082Z ##[endgroup]
2025-05-07T20:28:42.7877068Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:42.7879142Z ################################################################################
2025-05-07T20:28:42.7879826Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:42.7880357Z #
2025-05-07T20:28:42.7894671Z # [2025-05-07T20:28:42.789Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:42.7895244Z ################################################################################
2025-05-07T20:28:42.7895554Z 
2025-05-07T20:28:42.7910218Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:42.8839021Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:42.8848849Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:42.8849523Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:42.8849933Z 
2025-05-07T20:28:42.9774458Z 
2025-05-07T20:28:42.9774919Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:42.9798922Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:49.2327717Z Collecting environment information...
2025-05-07T20:28:49.2328320Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:28:49.2328654Z Is debug build: False
2025-05-07T20:28:49.2328913Z CUDA used to build PyTorch: 12.8
2025-05-07T20:28:49.2329213Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:49.2329395Z 
2025-05-07T20:28:49.2329526Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:49.2329870Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:49.2330335Z Clang version: Could not collect
2025-05-07T20:28:49.2330761Z CMake version: Could not collect
2025-05-07T20:28:49.2331100Z Libc version: glibc-2.34
2025-05-07T20:28:49.2331268Z 
2025-05-07T20:28:49.2331707Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:28:49.2332591Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:49.2333150Z Is CUDA available: True
2025-05-07T20:28:49.2333424Z CUDA runtime version: 12.8.61
2025-05-07T20:28:49.2333704Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:49.2334034Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:49.2334380Z Nvidia driver version: 570.133.07
2025-05-07T20:28:49.2334676Z cuDNN version: Could not collect
2025-05-07T20:28:49.2334955Z HIP runtime version: N/A
2025-05-07T20:28:49.2335221Z MIOpen runtime version: N/A
2025-05-07T20:28:49.2335499Z Is XNNPACK available: True
2025-05-07T20:28:49.2335668Z 
2025-05-07T20:28:49.2335751Z CPU:
2025-05-07T20:28:49.2335980Z Architecture:                         x86_64
2025-05-07T20:28:49.2336327Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:49.2336727Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:49.2337124Z Byte Order:                           Little Endian
2025-05-07T20:28:49.2337451Z CPU(s):                               16
2025-05-07T20:28:49.2337758Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:49.2338484Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:49.2338842Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:49.2339174Z CPU family:                           23
2025-05-07T20:28:49.2339464Z Model:                                49
2025-05-07T20:28:49.2339762Z Thread(s) per core:                   2
2025-05-07T20:28:49.2340061Z Core(s) per socket:                   8
2025-05-07T20:28:49.2340350Z Socket(s):                            1
2025-05-07T20:28:49.2340643Z Stepping:                             0
2025-05-07T20:28:49.2341111Z BogoMIPS:                             5600.08
2025-05-07T20:28:49.2343252Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:49.2345372Z Hypervisor vendor:                    KVM
2025-05-07T20:28:49.2345698Z Virtualization type:                  full
2025-05-07T20:28:49.2346044Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:49.2346425Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:49.2346808Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:49.2347173Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:49.2347506Z NUMA node(s):                         1
2025-05-07T20:28:49.2347814Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:49.2348157Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:49.2348549Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:49.2348924Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:49.2349290Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:49.2349650Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:49.2350022Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:49.2350405Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:49.2350965Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:49.2351573Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:49.2352138Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:49.2352853Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:49.2353909Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:49.2354605Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:49.2354980Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:49.2355218Z 
2025-05-07T20:28:49.2355325Z Versions of relevant libraries:
2025-05-07T20:28:49.2355606Z [pip3] numpy==2.2.5
2025-05-07T20:28:49.2355860Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:28:49.2356177Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:28:49.2356492Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:28:49.2356824Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:28:49.2357149Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:28:49.2357443Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:28:49.2357741Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:28:49.2358050Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:28:49.2358358Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:28:49.2358809Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:49.2359121Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:49.2359410Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:28:49.2359718Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:28:49.2360018Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:49.2360325Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:28:49.2360711Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:49.2361206Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:49.2361811Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:49.2362341Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:49.2362887Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:49.2363429Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:49.2363930Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2364403Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:28:49.2364896Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:49.2365407Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:49.2365896Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2366379Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:49.2366851Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2367318Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2367803Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:28:49.2368296Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:28:49.2368775Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:49.2369252Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:49.2369733Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2370206Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:28:49.2370682Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2371161Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:28:49.2371651Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:49.2372149Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:49.2372647Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2373145Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:28:49.2373646Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:49.2374148Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:49.2374620Z [conda] numpy                     2.2.5           py310hefbff90_0    conda-forge
2025-05-07T20:28:49.2375097Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:28:49.2375612Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:28:49.2376125Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:28:49.2376636Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:28:49.2377142Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:28:49.2377741Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:28:49.2378226Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:28:49.2378723Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:28:49.2379233Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:28:49.2379752Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:49.2380245Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:49.2380822Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:28:49.2381313Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:28:49.2381795Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:49.2382269Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:28:49.2382555Z 
2025-05-07T20:28:49.3085385Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:49.3086082Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:49.3098207Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:49.3098570Z env:
2025-05-07T20:28:49.3098811Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:49.3099123Z   BUILD_ENV: build_binary
2025-05-07T20:28:49.3099384Z   BUILD_TARGET: genai
2025-05-07T20:28:49.3099627Z   BUILD_VARIANT: cuda
2025-05-07T20:28:49.3099886Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:49.3100158Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:49.3100477Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:49.3100819Z ##[endgroup]
2025-05-07T20:28:49.6503704Z ################################################################################
2025-05-07T20:28:49.6504107Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:49.6504361Z #
2025-05-07T20:28:49.6524806Z # [2025-05-07T20:28:49.652Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:49.6525244Z ################################################################################
2025-05-07T20:28:49.6535551Z 
2025-05-07T20:28:49.6542525Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:49.7464965Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:49.7484653Z [BUILD] Running git submodules update ...
2025-05-07T20:28:49.7503941Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:49.7869852Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:49.7870344Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:49.7870808Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:49.7871220Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:49.7871652Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:49.7872113Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:49.7872541Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:49.7905664Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:49.8462248Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:49.8484541Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:52.2857811Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:52.3243451Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:52.4293677Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:52.4333957Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:52.6758390Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:52.6792926Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:52.7868491Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:52.7895161Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:53.1657621Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:53.1715962Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:53.2263618Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:53.2267651Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:53.3067611Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:53.3117998Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:53.3522442Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:53.4061039Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:53.4106723Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:53.5433616Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:53.5462589Z   Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:53.6515872Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:53.6552744Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:53.7083701Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:53.7737393Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:53.7775758Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:53.8774033Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:53.8800589Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:53.9895786Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:53.9932293Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:54.1013323Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.1056402Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:54.2035276Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.2063400Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:54.3098591Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.3174679Z   Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:54.4244914Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.4273648Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:54.5618319Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.5646818Z   Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB)
2025-05-07T20:28:54.6623661Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.6652803Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:54.7157647Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:54.7678444Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:54.7708660Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:54.8200698Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:54.8738885Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:54.8768384Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:54.9268695Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:54.9909664Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:54.9935827Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:55.0446175Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:55.0932252Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:55.1460035Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:55.6659298Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 53.6 MB/s eta 0:00:00
2025-05-07T20:28:55.6700694Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:55.7213237Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:55.7708110Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:55.8250538Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:55.8910745Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:55.9429581Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
2025-05-07T20:28:56.0058335Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.1 MB/s eta 0:00:00
2025-05-07T20:28:56.0097124Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:56.0621819Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:56.1102748Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:56.1581055Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:56.2159239Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:56.2642803Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
2025-05-07T20:28:56.3074837Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:56.3592343Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB)
2025-05-07T20:28:56.4121805Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:56.4603540Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:56.5134870Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:56.5654433Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:56.8031199Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:59.1755912Z 
2025-05-07T20:28:59.1830474Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0
2025-05-07T20:28:59.3649962Z ################################################################################
2025-05-07T20:28:59.3650508Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:59.3650896Z #
2025-05-07T20:28:59.3667391Z # [2025-05-07T20:28:59.366Z] + install_triton_pip build_binary
2025-05-07T20:28:59.3667987Z ################################################################################
2025-05-07T20:28:59.3668341Z 
2025-05-07T20:28:59.3668704Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:59.3669362Z ################################################################################
2025-05-07T20:28:59.3669898Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:59.3670359Z #
2025-05-07T20:28:59.3688278Z # [2025-05-07T20:28:59.368Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:59.3688828Z ################################################################################
2025-05-07T20:28:59.3689055Z 
2025-05-07T20:28:59.3706510Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:59.4648164Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:59.4648973Z ################################################################################
2025-05-07T20:28:59.4649451Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:59.4649844Z #
2025-05-07T20:28:59.4668708Z # [2025-05-07T20:28:59.466Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:59.4669224Z ################################################################################
2025-05-07T20:28:59.4669509Z 
2025-05-07T20:28:59.4716414Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:59.4733516Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:59.4734041Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:59.4742330Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:59.4751825Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:59.4772814Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:05.3193409Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:05.3194683Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:05.3195365Z 
2025-05-07T20:29:05.3195583Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:05.3196017Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:05.3196842Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:05.3198107Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:05.3199213Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 174.1 MB/s eta 0:00:00
2025-05-07T20:29:05.3199617Z Installing collected packages: pytorch-triton
2025-05-07T20:29:05.3199971Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:05.3200380Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:05.3200823Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:05.3201264Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:05.3201722Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:05.3201989Z 
2025-05-07T20:29:07.5651365Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:07.5655259Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:09.8700349Z ################################################################################
2025-05-07T20:29:09.8700862Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:09.8701276Z ################################################################################
2025-05-07T20:29:09.8701528Z 
2025-05-07T20:29:12.0479971Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:14.1904018Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:14.1907428Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:14.1945460Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:14.1946343Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:14.1959097Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:14.1959574Z env:
2025-05-07T20:29:14.1959904Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:14.1960395Z   BUILD_ENV: build_binary
2025-05-07T20:29:14.1960746Z   BUILD_TARGET: genai
2025-05-07T20:29:14.1961059Z   BUILD_VARIANT: cuda
2025-05-07T20:29:14.1961551Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:14.1961915Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:14.1962385Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:14.1962811Z ##[endgroup]
2025-05-07T20:29:14.5406129Z ################################################################################
2025-05-07T20:29:14.5406670Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:14.5407016Z #
2025-05-07T20:29:14.5423221Z # [2025-05-07T20:29:14.541Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5424338Z ################################################################################
2025-05-07T20:29:14.5424621Z 
2025-05-07T20:29:14.5425117Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5425963Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5426382Z 
2025-05-07T20:29:14.5587624Z 2bed2d996c113b97194d809bcd57307f8de8d387  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5590975Z 
2025-05-07T20:29:14.5591757Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5592194Z 
2025-05-07T20:29:14.5770727Z 4888273ec0852f505fccc81faa23a2d37bf7d3b8624276cf783c626cc6938b65  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5773772Z 
2025-05-07T20:29:14.5774666Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.5775240Z 
2025-05-07T20:29:14.6119206Z 8884054067b6c5891f141d668bcfc919  fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6122929Z 
2025-05-07T20:29:14.6134685Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:14.6157501Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:17.6395764Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl
2025-05-07T20:29:17.6396951Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:17.6397978Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:17.6398786Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:17.6399137Z 
2025-05-07T20:29:24.8613133Z ################################################################################
2025-05-07T20:29:24.8613813Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:24.8614376Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:24.8614973Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:24.8615471Z [CHECK]
2025-05-07T20:29:24.8615913Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:24.8616563Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:24.8617140Z ################################################################################
2025-05-07T20:29:24.8617408Z 
2025-05-07T20:29:24.8617596Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:29.0232342Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:33.0780010Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:37.2341482Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:37.2344578Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:49.5263742Z ################################################################################
2025-05-07T20:29:49.5264511Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:49.5264994Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:49.5265477Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:49.5266019Z ################################################################################
2025-05-07T20:29:49.5266334Z 
2025-05-07T20:29:57.7266216Z ################################################################################
2025-05-07T20:29:57.7267450Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:57.7270308Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:57.7273565Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:57.7274237Z ################################################################################
2025-05-07T20:29:57.7274558Z 
2025-05-07T20:29:57.7274798Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:01.8114798Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:05.9026075Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:10.1065925Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:14.2368261Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:14.2372395Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:18.2108820Z fbgemm.nccl_init
2025-05-07T20:30:18.2109037Z 
2025-05-07T20:30:18.2747240Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:22.3698908Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:22.3699199Z 
2025-05-07T20:30:22.4344772Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:26.4110366Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:26.4111176Z 
2025-05-07T20:30:26.4741340Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:26.4742094Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:26.4794186Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:26.4794704Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:26.4812055Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:26.4812479Z env:
2025-05-07T20:30:26.4812730Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:26.4813060Z   BUILD_ENV: build_binary
2025-05-07T20:30:26.4813320Z   BUILD_TARGET: genai
2025-05-07T20:30:26.4813570Z   BUILD_VARIANT: cuda
2025-05-07T20:30:26.4813828Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:26.4814099Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:26.4814426Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:26.4814787Z ##[endgroup]
2025-05-07T20:30:26.8249085Z ################################################################################
2025-05-07T20:30:26.8249462Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:26.8249819Z #
2025-05-07T20:30:26.8264569Z # [2025-05-07T20:30:26.826Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:26.8265002Z ################################################################################
2025-05-07T20:30:26.8265227Z 
2025-05-07T20:30:35.0138671Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:35.0139838Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:35.0140293Z [TEST] Determined the test directories:
2025-05-07T20:30:35.0140654Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:35.0140998Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:35.0141332Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:35.0141552Z 
2025-05-07T20:30:35.0147609Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:35.0154569Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:35.0155213Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:35.0155535Z 
2025-05-07T20:30:35.4767844Z 
2025-05-07T20:30:35.4768149Z [TEST] Installing PyTest ...
2025-05-07T20:30:35.4797299Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:36.6657437Z Channels:
2025-05-07T20:30:36.6657778Z  - conda-forge
2025-05-07T20:30:36.6658146Z Platform: linux-64
2025-05-07T20:30:40.1118950Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:41.2665383Z Solving environment: \ | / done
2025-05-07T20:30:41.4990438Z 
2025-05-07T20:30:41.4994563Z ## Package Plan ##
2025-05-07T20:30:41.4994829Z 
2025-05-07T20:30:41.4995156Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:41.4995487Z 
2025-05-07T20:30:41.4995599Z   added / updated specs:
2025-05-07T20:30:41.4995868Z     - expecttest
2025-05-07T20:30:41.4996105Z     - pytest
2025-05-07T20:30:41.4996237Z 
2025-05-07T20:30:41.4996242Z 
2025-05-07T20:30:41.4996437Z The following packages will be downloaded:
2025-05-07T20:30:41.4996705Z 
2025-05-07T20:30:41.4996882Z     package                    |            build
2025-05-07T20:30:41.4997367Z     ---------------------------|-----------------
2025-05-07T20:30:41.4997921Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:41.4998498Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:41.4998993Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:41.4999467Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:41.4999935Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:41.5000382Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:41.5000823Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:41.5001648Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:41.5002158Z     ------------------------------------------------------------
2025-05-07T20:30:41.5002589Z                                            Total:         428 KB
2025-05-07T20:30:41.5002913Z 
2025-05-07T20:30:41.5003088Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:41.5003323Z 
2025-05-07T20:30:41.5003543Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:41.5004080Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:41.5004628Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:41.5005135Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:41.5005632Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:41.5006104Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:41.5006572Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:41.5007024Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:41.5007296Z 
2025-05-07T20:30:41.5007300Z 
2025-05-07T20:30:41.5007305Z 
2025-05-07T20:30:41.5007473Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:41.5008074Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:41.5008312Z 
2025-05-07T20:30:41.5008689Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:41.5008946Z 
2025-05-07T20:30:41.5008950Z 
2025-05-07T20:30:41.5013508Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:41.5013881Z 
2025-05-07T20:30:41.5013897Z 
2025-05-07T20:30:41.5017363Z 
2025-05-07T20:30:41.5030106Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:41.5030477Z 
2025-05-07T20:30:41.5030488Z 
2025-05-07T20:30:41.5030492Z 
2025-05-07T20:30:41.5030496Z 
2025-05-07T20:30:41.5037529Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:41.5037959Z 
2025-05-07T20:30:41.5037964Z 
2025-05-07T20:30:41.5037975Z 
2025-05-07T20:30:41.5037979Z 
2025-05-07T20:30:41.5037983Z 
2025-05-07T20:30:41.5042341Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:41.5042644Z 
2025-05-07T20:30:41.5042648Z 
2025-05-07T20:30:41.5042659Z 
2025-05-07T20:30:41.5042663Z 
2025-05-07T20:30:41.5042666Z 
2025-05-07T20:30:41.5042670Z 
2025-05-07T20:30:41.5049993Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:41.5050303Z 
2025-05-07T20:30:41.5050315Z 
2025-05-07T20:30:41.5050319Z 
2025-05-07T20:30:41.5050323Z 
2025-05-07T20:30:41.5050327Z 
2025-05-07T20:30:41.5050330Z 
2025-05-07T20:30:41.5052919Z 
2025-05-07T20:30:41.5890689Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:41.5890995Z 
2025-05-07T20:30:41.5890999Z 
2025-05-07T20:30:41.5891003Z 
2025-05-07T20:30:41.5893100Z 
2025-05-07T20:30:41.6816410Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:41.6816803Z 
2025-05-07T20:30:41.6816808Z 
2025-05-07T20:30:41.6816813Z 
2025-05-07T20:30:41.6816818Z 
2025-05-07T20:30:41.6820127Z 
2025-05-07T20:30:41.7395398Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:41.7395703Z 
2025-05-07T20:30:41.7395708Z 
2025-05-07T20:30:41.7395712Z 
2025-05-07T20:30:41.7395716Z 
2025-05-07T20:30:41.7403365Z 
2025-05-07T20:30:41.8112630Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:41.8113030Z 
2025-05-07T20:30:41.8113036Z 
2025-05-07T20:30:41.8113042Z 
2025-05-07T20:30:41.8113047Z 
2025-05-07T20:30:41.8113052Z 
2025-05-07T20:30:41.8113771Z 
2025-05-07T20:30:41.8118303Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.8118602Z 
2025-05-07T20:30:41.8119381Z 
2025-05-07T20:30:41.8176381Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:30:41.8176677Z 
2025-05-07T20:30:41.8176681Z 
2025-05-07T20:30:41.8176918Z 
2025-05-07T20:30:41.8176923Z 
2025-05-07T20:30:41.8176927Z 
2025-05-07T20:30:41.8177327Z 
2025-05-07T20:30:41.8764790Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.8765098Z 
2025-05-07T20:30:41.8766369Z 
2025-05-07T20:30:41.8805752Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:41.8816801Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:30:41.8817101Z 
2025-05-07T20:30:41.8817105Z 
2025-05-07T20:30:41.8817109Z 
2025-05-07T20:30:41.8817113Z 
2025-05-07T20:30:41.8817117Z 
2025-05-07T20:30:41.8817121Z 
2025-05-07T20:30:41.8820479Z 
2025-05-07T20:30:41.8864414Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.8864794Z 
2025-05-07T20:30:41.8864798Z 
2025-05-07T20:30:41.8864802Z 
2025-05-07T20:30:41.8864806Z 
2025-05-07T20:30:41.8864809Z 
2025-05-07T20:30:41.8864813Z 
2025-05-07T20:30:41.8864817Z 
2025-05-07T20:30:41.9002911Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.9069303Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.9069805Z 
2025-05-07T20:30:41.9069814Z 
2025-05-07T20:30:41.9069822Z 
2025-05-07T20:30:41.9070207Z 
2025-05-07T20:30:41.9074284Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:41.9074870Z 
2025-05-07T20:30:41.9074879Z 
2025-05-07T20:30:41.9074886Z 
2025-05-07T20:30:41.9074894Z 
2025-05-07T20:30:41.9129273Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:41.9129569Z 
2025-05-07T20:30:41.9129573Z 
2025-05-07T20:30:41.9129576Z 
2025-05-07T20:30:41.9129580Z 
2025-05-07T20:30:41.9130023Z 
2025-05-07T20:30:41.9132284Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:41.9132570Z 
2025-05-07T20:30:41.9132574Z 
2025-05-07T20:30:41.9132578Z 
2025-05-07T20:30:41.9132581Z 
2025-05-07T20:30:41.9132585Z 
2025-05-07T20:30:41.9132589Z 
2025-05-07T20:30:41.9309603Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:41.9309901Z 
2025-05-07T20:30:41.9309905Z 
2025-05-07T20:30:41.9310554Z 
2025-05-07T20:30:41.9337851Z pluggy-1.5.0         | 23 KB     | ######9    |  69% [A[A[A
2025-05-07T20:30:41.9338132Z 
2025-05-07T20:30:41.9338137Z 
2025-05-07T20:30:41.9338141Z 
2025-05-07T20:30:41.9488672Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:41.9489612Z 
2025-05-07T20:30:41.9542056Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:30:41.9543832Z 
2025-05-07T20:30:41.9550705Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:41.9550963Z 
2025-05-07T20:30:41.9550967Z 
2025-05-07T20:30:41.9550971Z 
2025-05-07T20:30:41.9550982Z 
2025-05-07T20:30:41.9550986Z 
2025-05-07T20:30:41.9550989Z 
2025-05-07T20:30:41.9550993Z 
2025-05-07T20:30:41.9573491Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:41.9573775Z 
2025-05-07T20:30:41.9575175Z 
2025-05-07T20:30:41.9582671Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:41.9582942Z 
2025-05-07T20:30:41.9583060Z 
2025-05-07T20:30:41.9888769Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:41.9889122Z 
2025-05-07T20:30:41.9889126Z 
2025-05-07T20:30:41.9889130Z 
2025-05-07T20:30:41.9940832Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:41.9941088Z 
2025-05-07T20:30:41.9983816Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:41.9984317Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.9990905Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:41.9991378Z                                                      
2025-05-07T20:30:41.9991670Z 
2025-05-07T20:30:41.9991946Z                                                      [A
2025-05-07T20:30:41.9992206Z 
2025-05-07T20:30:41.9992212Z 
2025-05-07T20:30:41.9992604Z                                                      [A[A
2025-05-07T20:30:41.9992838Z 
2025-05-07T20:30:41.9992842Z 
2025-05-07T20:30:41.9992846Z 
2025-05-07T20:30:41.9993023Z                                                      [A[A[A
2025-05-07T20:30:41.9993249Z 
2025-05-07T20:30:41.9993252Z 
2025-05-07T20:30:41.9993256Z 
2025-05-07T20:30:41.9993260Z 
2025-05-07T20:30:41.9993444Z                                                      [A[A[A[A
2025-05-07T20:30:41.9993855Z 
2025-05-07T20:30:41.9993860Z 
2025-05-07T20:30:41.9993866Z 
2025-05-07T20:30:41.9993871Z 
2025-05-07T20:30:41.9993876Z 
2025-05-07T20:30:41.9994137Z                                                      [A[A[A[A[A
2025-05-07T20:30:41.9994418Z 
2025-05-07T20:30:41.9994433Z 
2025-05-07T20:30:41.9994438Z 
2025-05-07T20:30:41.9994443Z 
2025-05-07T20:30:41.9994448Z 
2025-05-07T20:30:41.9994454Z 
2025-05-07T20:30:41.9994672Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:41.9994902Z 
2025-05-07T20:30:41.9994906Z 
2025-05-07T20:30:41.9994924Z 
2025-05-07T20:30:41.9994928Z 
2025-05-07T20:30:41.9994932Z 
2025-05-07T20:30:41.9994936Z 
2025-05-07T20:30:41.9994940Z 
2025-05-07T20:30:41.9995136Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:42.1002559Z Preparing transaction: \ done
2025-05-07T20:30:42.2004374Z Verifying transaction: / done
2025-05-07T20:30:44.1034511Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:44.2399329Z [TEST] Checking imports ...
2025-05-07T20:30:48.4029067Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:48.4041182Z [TEST] Setting feature flags ...
2025-05-07T20:30:48.4041635Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:48.4041979Z 
2025-05-07T20:30:48.8267672Z 
2025-05-07T20:30:48.8269206Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:48.8270041Z ################################################################################
2025-05-07T20:30:48.8270380Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:48.8270637Z #
2025-05-07T20:30:48.8290223Z # [2025-05-07T20:30:48.828Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:48.8290826Z ################################################################################
2025-05-07T20:30:48.8291057Z 
2025-05-07T20:30:48.8298330Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:48.8328343Z ./attention/gqa_test.py
2025-05-07T20:30:48.8328759Z ./coalesce/coalesce_test.py
2025-05-07T20:30:48.8329068Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:48.8329366Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:48.8329674Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:48.8329943Z ./moe/activation_test.py
2025-05-07T20:30:48.8330208Z ./moe/gather_scatter_test.py
2025-05-07T20:30:48.8330466Z ./moe/layers_test.py
2025-05-07T20:30:48.8330707Z ./moe/shuffling_test.py
2025-05-07T20:30:48.8330970Z ./quantize/quantize_test.py
2025-05-07T20:30:48.8331141Z 
2025-05-07T20:30:48.8331267Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:48.8331485Z 
2025-05-07T20:30:48.8350595Z ################################################################################
2025-05-07T20:30:48.8366850Z # [2025-05-07T20:30:48.836Z] Run Python Test Suite:
2025-05-07T20:30:48.8367206Z #   ./attention/gqa_test.py
2025-05-07T20:30:48.8367495Z ################################################################################
2025-05-07T20:30:48.8391990Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:48.8392871Z 
2025-05-07T20:30:51.4582763Z ============================= test session starts ==============================
2025-05-07T20:30:51.4583991Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:51.4585446Z cachedir: .pytest_cache
2025-05-07T20:30:51.4586589Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:51.4588016Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:51.4588832Z plugins: hypothesis-6.131.14
2025-05-07T20:30:53.1146436Z collecting ... collected 2 items
2025-05-07T20:30:53.1146813Z 
2025-05-07T20:31:30.7769317Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:30.7770061Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7770512Z     int4_kv=False,
2025-05-07T20:31:30.7770811Z     num_groups=1,
2025-05-07T20:31:30.7771113Z     B=1,
2025-05-07T20:31:30.7771381Z     MAX_T=4,
2025-05-07T20:31:30.7771654Z     N_H_L=1,
2025-05-07T20:31:30.7771919Z )
2025-05-07T20:31:30.7772197Z Trying example: test_gqa(
2025-05-07T20:31:30.7772607Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7773071Z     int4_kv=True,
2025-05-07T20:31:30.7773368Z     num_groups=1,
2025-05-07T20:31:30.7773702Z     B=1,
2025-05-07T20:31:30.7773956Z     MAX_T=4,
2025-05-07T20:31:30.7774259Z     N_H_L=1,
2025-05-07T20:31:30.7774525Z )
2025-05-07T20:31:30.7775244Z Trying example: test_gqa(
2025-05-07T20:31:30.7775651Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7776096Z     int4_kv=True,
2025-05-07T20:31:30.7776385Z     num_groups=4,
2025-05-07T20:31:30.7776674Z     B=23,
2025-05-07T20:31:30.7776940Z     MAX_T=33,
2025-05-07T20:31:30.7777208Z     N_H_L=68,
2025-05-07T20:31:30.7777477Z )
2025-05-07T20:31:30.7777748Z Trying example: test_gqa(
2025-05-07T20:31:30.7778145Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7778576Z     int4_kv=True,
2025-05-07T20:31:30.7778870Z     num_groups=4,
2025-05-07T20:31:30.7779151Z     B=77,
2025-05-07T20:31:30.7779413Z     MAX_T=4,
2025-05-07T20:31:30.7779688Z     N_H_L=1,
2025-05-07T20:31:30.7779944Z )
2025-05-07T20:31:30.7780218Z Trying example: test_gqa(
2025-05-07T20:31:30.7780624Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7781053Z     int4_kv=True,
2025-05-07T20:31:30.7781347Z     num_groups=4,
2025-05-07T20:31:30.7781634Z     B=77,
2025-05-07T20:31:30.7781897Z     MAX_T=52,
2025-05-07T20:31:30.7782174Z     N_H_L=67,
2025-05-07T20:31:30.7782445Z )
2025-05-07T20:31:30.7782717Z Trying example: test_gqa(
2025-05-07T20:31:30.7783112Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7783548Z     int4_kv=False,
2025-05-07T20:31:30.7783845Z     num_groups=4,
2025-05-07T20:31:30.7784129Z     B=57,
2025-05-07T20:31:30.7784394Z     MAX_T=45,
2025-05-07T20:31:30.7784670Z     N_H_L=120,
2025-05-07T20:31:30.7784937Z )
2025-05-07T20:31:30.7785208Z Trying example: test_gqa(
2025-05-07T20:31:30.7785612Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7786034Z     int4_kv=True,
2025-05-07T20:31:30.7786329Z     num_groups=4,
2025-05-07T20:31:30.7786614Z     B=52,
2025-05-07T20:31:30.7786874Z     MAX_T=42,
2025-05-07T20:31:30.7787155Z     N_H_L=53,
2025-05-07T20:31:30.7787424Z )
2025-05-07T20:31:30.7787686Z Trying example: test_gqa(
2025-05-07T20:31:30.7788092Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7788543Z     int4_kv=True,
2025-05-07T20:31:30.7788829Z     num_groups=1,
2025-05-07T20:31:30.7789118Z     B=77,
2025-05-07T20:31:30.7789380Z     MAX_T=95,
2025-05-07T20:31:30.7789645Z     N_H_L=53,
2025-05-07T20:31:30.7789913Z )
2025-05-07T20:31:30.7790185Z Trying example: test_gqa(
2025-05-07T20:31:30.7790580Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7791016Z     int4_kv=True,
2025-05-07T20:31:30.7791312Z     num_groups=4,
2025-05-07T20:31:30.7791592Z     B=113,
2025-05-07T20:31:30.7791859Z     MAX_T=48,
2025-05-07T20:31:30.7792135Z     N_H_L=96,
2025-05-07T20:31:30.7792401Z )
2025-05-07T20:31:30.7792664Z Trying example: test_gqa(
2025-05-07T20:31:30.7793065Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7793874Z     int4_kv=False,
2025-05-07T20:31:30.7794186Z     num_groups=1,
2025-05-07T20:31:30.7794477Z     B=51,
2025-05-07T20:31:30.7794746Z     MAX_T=61,
2025-05-07T20:31:30.7795015Z     N_H_L=69,
2025-05-07T20:31:30.7795290Z )
2025-05-07T20:31:30.7795557Z Trying example: test_gqa(
2025-05-07T20:31:30.7795954Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7796387Z     int4_kv=False,
2025-05-07T20:31:30.7796680Z     num_groups=4,
2025-05-07T20:31:30.7796962Z     B=17,
2025-05-07T20:31:30.7797230Z     MAX_T=113,
2025-05-07T20:31:30.7797507Z     N_H_L=65,
2025-05-07T20:31:30.7797769Z )
2025-05-07T20:31:30.7798039Z Trying example: test_gqa(
2025-05-07T20:31:30.7798438Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7798864Z     int4_kv=False,
2025-05-07T20:31:30.7799164Z     num_groups=4,
2025-05-07T20:31:30.7799453Z     B=17,
2025-05-07T20:31:30.7799715Z     MAX_T=65,
2025-05-07T20:31:30.7799989Z     N_H_L=65,
2025-05-07T20:31:30.7800257Z )
2025-05-07T20:31:30.7800529Z Trying example: test_gqa(
2025-05-07T20:31:30.7800934Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7801418Z     int4_kv=False,
2025-05-07T20:31:30.7801710Z     num_groups=4,
2025-05-07T20:31:30.7802110Z     B=65,
2025-05-07T20:31:30.7802375Z     MAX_T=65,
2025-05-07T20:31:30.7802652Z     N_H_L=65,
2025-05-07T20:31:30.7802916Z )
2025-05-07T20:31:30.7803185Z Trying example: test_gqa(
2025-05-07T20:31:30.7803590Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7804018Z     int4_kv=False,
2025-05-07T20:31:30.7804312Z     num_groups=1,
2025-05-07T20:31:30.7804598Z     B=6,
2025-05-07T20:31:30.7804857Z     MAX_T=108,
2025-05-07T20:31:30.7805139Z     N_H_L=14,
2025-05-07T20:31:30.7805408Z )
2025-05-07T20:31:30.7805671Z Trying example: test_gqa(
2025-05-07T20:31:30.7806074Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7806507Z     int4_kv=False,
2025-05-07T20:31:30.7806795Z     num_groups=1,
2025-05-07T20:31:30.7807089Z     B=6,
2025-05-07T20:31:30.7807351Z     MAX_T=14,
2025-05-07T20:31:30.7807623Z     N_H_L=14,
2025-05-07T20:31:30.7807891Z )
2025-05-07T20:31:30.7808159Z Trying example: test_gqa(
2025-05-07T20:31:30.7808554Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7808999Z     int4_kv=False,
2025-05-07T20:31:30.7809294Z     num_groups=1,
2025-05-07T20:31:30.7809578Z     B=6,
2025-05-07T20:31:30.7809838Z     MAX_T=6,
2025-05-07T20:31:30.7810112Z     N_H_L=14,
2025-05-07T20:31:30.7810375Z )
2025-05-07T20:31:30.7810642Z Trying example: test_gqa(
2025-05-07T20:31:30.7811040Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7811511Z     int4_kv=False,
2025-05-07T20:31:30.7811809Z     num_groups=1,
2025-05-07T20:31:30.7812100Z     B=6,
2025-05-07T20:31:30.7812353Z     MAX_T=6,
2025-05-07T20:31:30.7812624Z     N_H_L=6,
2025-05-07T20:31:30.7812889Z )
2025-05-07T20:31:30.7813151Z Trying example: test_gqa(
2025-05-07T20:31:30.7813558Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7825494Z     int4_kv=False,
2025-05-07T20:31:30.7825765Z     num_groups=1,
2025-05-07T20:31:30.7826000Z     B=70,
2025-05-07T20:31:30.7826218Z     MAX_T=94,
2025-05-07T20:31:30.7826443Z     N_H_L=78,
2025-05-07T20:31:30.7826667Z )
2025-05-07T20:31:30.7826893Z Trying example: test_gqa(
2025-05-07T20:31:30.7827235Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7827585Z     int4_kv=False,
2025-05-07T20:31:30.7827829Z     num_groups=1,
2025-05-07T20:31:30.7828070Z     B=78,
2025-05-07T20:31:30.7828279Z     MAX_T=94,
2025-05-07T20:31:30.7828504Z     N_H_L=78,
2025-05-07T20:31:30.7828722Z )
2025-05-07T20:31:30.7828936Z Trying example: test_gqa(
2025-05-07T20:31:30.7829267Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7829619Z     int4_kv=False,
2025-05-07T20:31:30.7829853Z     num_groups=1,
2025-05-07T20:31:30.7830089Z     B=94,
2025-05-07T20:31:30.7830304Z     MAX_T=94,
2025-05-07T20:31:30.7830527Z     N_H_L=78,
2025-05-07T20:31:30.7830941Z )
2025-05-07T20:31:30.7831168Z Trying example: test_gqa(
2025-05-07T20:31:30.7831490Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7831842Z     int4_kv=False,
2025-05-07T20:31:30.7832092Z     num_groups=1,
2025-05-07T20:31:30.7832322Z     B=94,
2025-05-07T20:31:30.7832539Z     MAX_T=94,
2025-05-07T20:31:30.7832762Z     N_H_L=94,
2025-05-07T20:31:30.7832975Z )
2025-05-07T20:31:30.7833194Z Trying example: test_gqa(
2025-05-07T20:31:30.7833521Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7833942Z     int4_kv=False,
2025-05-07T20:31:30.7834175Z     num_groups=4,
2025-05-07T20:31:30.7834402Z     B=41,
2025-05-07T20:31:30.7834615Z     MAX_T=105,
2025-05-07T20:31:30.7834842Z     N_H_L=126,
2025-05-07T20:31:30.7835053Z )
2025-05-07T20:31:30.7835272Z Trying example: test_gqa(
2025-05-07T20:31:30.7835597Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7835938Z     int4_kv=False,
2025-05-07T20:31:30.7836173Z     num_groups=4,
2025-05-07T20:31:30.7836401Z     B=105,
2025-05-07T20:31:30.7836618Z     MAX_T=105,
2025-05-07T20:31:30.7836836Z     N_H_L=126,
2025-05-07T20:31:30.7837056Z )
2025-05-07T20:31:30.7837275Z Trying example: test_gqa(
2025-05-07T20:31:30.7837750Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7838099Z     int4_kv=False,
2025-05-07T20:31:30.7838332Z     num_groups=4,
2025-05-07T20:31:30.7838556Z     B=105,
2025-05-07T20:31:30.7838771Z     MAX_T=105,
2025-05-07T20:31:30.7838996Z     N_H_L=105,
2025-05-07T20:31:30.7839209Z )
2025-05-07T20:31:30.7839428Z Trying example: test_gqa(
2025-05-07T20:31:30.7839749Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7840081Z     int4_kv=True,
2025-05-07T20:31:30.7840308Z     num_groups=1,
2025-05-07T20:31:30.7840537Z     B=95,
2025-05-07T20:31:30.7840749Z     MAX_T=114,
2025-05-07T20:31:30.7840980Z     N_H_L=43,
2025-05-07T20:31:30.7841194Z )
2025-05-07T20:31:30.7841403Z Trying example: test_gqa(
2025-05-07T20:31:30.7841733Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7842078Z     int4_kv=True,
2025-05-07T20:31:30.7842305Z     num_groups=1,
2025-05-07T20:31:30.7842533Z     B=43,
2025-05-07T20:31:30.7842745Z     MAX_T=114,
2025-05-07T20:31:30.7842973Z     N_H_L=43,
2025-05-07T20:31:30.7843189Z )
2025-05-07T20:31:30.7843408Z Trying example: test_gqa(
2025-05-07T20:31:30.7843729Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7844075Z     int4_kv=True,
2025-05-07T20:31:30.7844309Z     num_groups=1,
2025-05-07T20:31:30.7844541Z     B=43,
2025-05-07T20:31:30.7844747Z     MAX_T=43,
2025-05-07T20:31:30.7844967Z     N_H_L=43,
2025-05-07T20:31:30.7845183Z )
2025-05-07T20:31:30.7845393Z Trying example: test_gqa(
2025-05-07T20:31:30.7845717Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7846064Z     int4_kv=False,
2025-05-07T20:31:30.7846294Z     num_groups=1,
2025-05-07T20:31:30.7846524Z     B=21,
2025-05-07T20:31:30.7846735Z     MAX_T=38,
2025-05-07T20:31:30.7846955Z     N_H_L=42,
2025-05-07T20:31:30.7847171Z )
2025-05-07T20:31:30.7847385Z Trying example: test_gqa(
2025-05-07T20:31:30.7847733Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7848073Z     int4_kv=False,
2025-05-07T20:31:30.7848317Z     num_groups=1,
2025-05-07T20:31:30.7848549Z     B=38,
2025-05-07T20:31:30.7848755Z     MAX_T=38,
2025-05-07T20:31:30.7848978Z     N_H_L=42,
2025-05-07T20:31:30.7849193Z )
2025-05-07T20:31:30.7849404Z Trying example: test_gqa(
2025-05-07T20:31:30.7849729Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7850077Z     int4_kv=False,
2025-05-07T20:31:30.7850309Z     num_groups=1,
2025-05-07T20:31:30.7850539Z     B=38,
2025-05-07T20:31:30.7850779Z     MAX_T=42,
2025-05-07T20:31:30.7851048Z     N_H_L=42,
2025-05-07T20:31:30.7851318Z )
2025-05-07T20:31:30.7851591Z Trying example: test_gqa(
2025-05-07T20:31:30.7851996Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7852430Z     int4_kv=False,
2025-05-07T20:31:30.7852848Z     num_groups=1,
2025-05-07T20:31:30.7853136Z     B=42,
2025-05-07T20:31:30.7853393Z     MAX_T=42,
2025-05-07T20:31:30.7853615Z     N_H_L=42,
2025-05-07T20:31:30.7853826Z )
2025-05-07T20:31:30.7854050Z Trying example: test_gqa(
2025-05-07T20:31:30.7854377Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7854723Z     int4_kv=True,
2025-05-07T20:31:30.7854954Z     num_groups=1,
2025-05-07T20:31:30.7855185Z     B=74,
2025-05-07T20:31:30.7855398Z     MAX_T=20,
2025-05-07T20:31:30.7855611Z     N_H_L=15,
2025-05-07T20:31:30.7855828Z )
2025-05-07T20:31:30.7856044Z Trying example: test_gqa(
2025-05-07T20:31:30.7856364Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7856713Z     int4_kv=True,
2025-05-07T20:31:30.7856950Z     num_groups=1,
2025-05-07T20:31:30.7857176Z     B=20,
2025-05-07T20:31:30.7857390Z     MAX_T=20,
2025-05-07T20:31:30.7857612Z     N_H_L=15,
2025-05-07T20:31:30.7857824Z )
2025-05-07T20:31:30.7858044Z Trying example: test_gqa(
2025-05-07T20:31:30.7858380Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7858724Z     int4_kv=True,
2025-05-07T20:31:30.7858960Z     num_groups=1,
2025-05-07T20:31:30.7859195Z     B=20,
2025-05-07T20:31:30.7859502Z     MAX_T=15,
2025-05-07T20:31:30.7859724Z     N_H_L=15,
2025-05-07T20:31:30.7859946Z )
2025-05-07T20:31:30.7860158Z Trying example: test_gqa(
2025-05-07T20:31:30.7860485Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7860886Z     int4_kv=True,
2025-05-07T20:31:30.7861174Z     num_groups=1,
2025-05-07T20:31:30.7861459Z     B=15,
2025-05-07T20:31:30.7861724Z     MAX_T=20,
2025-05-07T20:31:30.7861990Z     N_H_L=15,
2025-05-07T20:31:30.7862261Z )
2025-05-07T20:31:30.7862529Z Trying example: test_gqa(
2025-05-07T20:31:30.7862853Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7863201Z     int4_kv=True,
2025-05-07T20:31:30.7863439Z     num_groups=1,
2025-05-07T20:31:30.7863671Z     B=15,
2025-05-07T20:31:30.7863882Z     MAX_T=15,
2025-05-07T20:31:30.7864105Z     N_H_L=15,
2025-05-07T20:31:30.7864322Z )
2025-05-07T20:31:30.7864533Z Trying example: test_gqa(
2025-05-07T20:31:30.7864861Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7865216Z     int4_kv=False,
2025-05-07T20:31:30.7865449Z     num_groups=4,
2025-05-07T20:31:30.7865681Z     B=117,
2025-05-07T20:31:30.7865895Z     MAX_T=104,
2025-05-07T20:31:30.7866116Z     N_H_L=69,
2025-05-07T20:31:30.7866330Z )
2025-05-07T20:31:30.7866546Z Trying example: test_gqa(
2025-05-07T20:31:30.7866864Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7867211Z     int4_kv=False,
2025-05-07T20:31:30.7867450Z     num_groups=4,
2025-05-07T20:31:30.7867674Z     B=117,
2025-05-07T20:31:30.7867888Z     MAX_T=117,
2025-05-07T20:31:30.7868112Z     N_H_L=69,
2025-05-07T20:31:30.7868321Z )
2025-05-07T20:31:30.7868539Z Trying example: test_gqa(
2025-05-07T20:31:30.7868865Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7869210Z     int4_kv=False,
2025-05-07T20:31:30.7869446Z     num_groups=4,
2025-05-07T20:31:30.7869678Z     B=69,
2025-05-07T20:31:30.7869884Z     MAX_T=117,
2025-05-07T20:31:30.7870107Z     N_H_L=69,
2025-05-07T20:31:30.7870338Z )
2025-05-07T20:31:30.7870549Z Trying example: test_gqa(
2025-05-07T20:31:30.7870874Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:30.7871221Z     int4_kv=False,
2025-05-07T20:31:30.7871458Z     num_groups=4,
2025-05-07T20:31:30.7871682Z     B=117,
2025-05-07T20:31:30.7871898Z     MAX_T=69,
2025-05-07T20:31:30.7872123Z     N_H_L=69,
2025-05-07T20:31:30.7872334Z )
2025-05-07T20:31:30.7872540Z PASSED
2025-05-07T20:31:30.8151759Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:30.8152145Z 
2025-05-07T20:31:30.8152322Z =========================== short test summary info ============================
2025-05-07T20:31:30.8153327Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:30.8154184Z ======================== 1 passed, 1 skipped in 39.91s =========================
2025-05-07T20:31:31.4980622Z 
2025-05-07T20:31:31.4981504Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:31.5004634Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:31:31.5005059Z 
2025-05-07T20:31:31.5005063Z 
2025-05-07T20:31:31.5005068Z 
2025-05-07T20:31:31.5005072Z 
2025-05-07T20:31:31.5027290Z ################################################################################
2025-05-07T20:31:31.5044132Z # [2025-05-07T20:31:31.504Z] Run Python Test Suite:
2025-05-07T20:31:31.5044553Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:31.5045010Z ################################################################################
2025-05-07T20:31:31.5072177Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:31.5072873Z 
2025-05-07T20:31:33.8148117Z ============================= test session starts ==============================
2025-05-07T20:31:33.8148997Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:33.8149979Z cachedir: .pytest_cache
2025-05-07T20:31:33.8150637Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:33.8151447Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:33.8151908Z plugins: hypothesis-6.131.14
2025-05-07T20:31:35.4322938Z collecting ... collected 1 item
2025-05-07T20:31:35.4323261Z 
2025-05-07T20:31:36.1698325Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:36.1698802Z 
2025-05-07T20:31:36.1699018Z ============================== 1 passed in 2.49s ===============================
2025-05-07T20:31:36.7994457Z 
2025-05-07T20:31:36.7995126Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:36.8015541Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:36.8015867Z 
2025-05-07T20:31:36.8015872Z 
2025-05-07T20:31:36.8015876Z 
2025-05-07T20:31:36.8015880Z 
2025-05-07T20:31:36.8036264Z ################################################################################
2025-05-07T20:31:36.8050860Z # [2025-05-07T20:31:36.804Z] Run Python Test Suite:
2025-05-07T20:31:36.8052998Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:36.8053456Z ################################################################################
2025-05-07T20:31:36.8079131Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:36.8079772Z 
2025-05-07T20:31:39.0456437Z ============================= test session starts ==============================
2025-05-07T20:31:39.0457204Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:39.0457795Z cachedir: .pytest_cache
2025-05-07T20:31:39.0458461Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:39.0459285Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:39.0459749Z plugins: hypothesis-6.131.14
2025-05-07T20:31:40.7445514Z collecting ... collected 5 items
2025-05-07T20:31:40.7445861Z 
2025-05-07T20:31:40.7456957Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:40.7465731Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:40.7473782Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:40.7482313Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:40.7498419Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:40.7498932Z 
2025-05-07T20:31:40.7499115Z =========================== short test summary info ============================
2025-05-07T20:31:40.7499885Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.7500919Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.7501946Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.7502967Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.7503987Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:40.7504758Z ============================== 5 skipped in 1.84s ==============================
2025-05-07T20:31:41.3406142Z 
2025-05-07T20:31:41.3407065Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:41.3428301Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:41.3428653Z 
2025-05-07T20:31:41.3428658Z 
2025-05-07T20:31:41.3428662Z 
2025-05-07T20:31:41.3428666Z 
2025-05-07T20:31:41.3452519Z ################################################################################
2025-05-07T20:31:41.3469302Z # [2025-05-07T20:31:41.346Z] Run Python Test Suite:
2025-05-07T20:31:41.3469690Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:41.3470038Z ################################################################################
2025-05-07T20:31:41.3495848Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:41.3496578Z 
2025-05-07T20:31:43.5647051Z ============================= test session starts ==============================
2025-05-07T20:31:43.5647720Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:43.5648268Z cachedir: .pytest_cache
2025-05-07T20:31:43.5648883Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:43.5649638Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:43.5650074Z plugins: hypothesis-6.131.14
2025-05-07T20:31:45.2417642Z collecting ... collected 2 items
2025-05-07T20:31:45.2417928Z 
2025-05-07T20:31:45.2429543Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:45.2444050Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:45.2444510Z 
2025-05-07T20:31:45.2444669Z =========================== short test summary info ============================
2025-05-07T20:31:45.2445345Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:45.2446220Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:45.2446846Z ============================== 2 skipped in 1.80s ==============================
2025-05-07T20:31:45.8417132Z 
2025-05-07T20:31:45.8417793Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:45.8439764Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:31:45.8440128Z 
2025-05-07T20:31:45.8440133Z 
2025-05-07T20:31:45.8440158Z 
2025-05-07T20:31:45.8440568Z 
2025-05-07T20:31:45.8464438Z ################################################################################
2025-05-07T20:31:45.8481969Z # [2025-05-07T20:31:45.847Z] Run Python Test Suite:
2025-05-07T20:31:45.8482362Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:45.8482690Z ################################################################################
2025-05-07T20:31:45.8509538Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:45.8510218Z 
2025-05-07T20:31:48.1650038Z ============================= test session starts ==============================
2025-05-07T20:31:48.1650739Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:48.1651319Z cachedir: .pytest_cache
2025-05-07T20:31:48.1651983Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:48.1652789Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:48.1653238Z plugins: hypothesis-6.131.14
2025-05-07T20:31:49.8498467Z collecting ... collected 4 items
2025-05-07T20:31:49.8498929Z 
2025-05-07T20:31:52.7574664Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:52.7708076Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:52.7866940Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:52.8001076Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:52.8001539Z 
2025-05-07T20:31:52.8001707Z =========================== short test summary info ============================
2025-05-07T20:31:52.8002462Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:52.8003422Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available
2025-05-07T20:31:52.8004073Z ============================== 4 skipped in 4.77s ==============================
2025-05-07T20:31:54.7236350Z 
2025-05-07T20:31:54.7237077Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:54.7259416Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:31:54.7259751Z 
2025-05-07T20:31:54.7259764Z 
2025-05-07T20:31:54.7259769Z 
2025-05-07T20:31:54.7259773Z 
2025-05-07T20:31:54.7282167Z ################################################################################
2025-05-07T20:31:54.7299118Z # [2025-05-07T20:31:54.729Z] Run Python Test Suite:
2025-05-07T20:31:54.7299501Z #   ./moe/activation_test.py
2025-05-07T20:31:54.7299829Z ################################################################################
2025-05-07T20:31:54.7328312Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:54.7328993Z 
2025-05-07T20:31:57.0591067Z ============================= test session starts ==============================
2025-05-07T20:31:57.0591810Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:57.0592393Z cachedir: .pytest_cache
2025-05-07T20:31:57.0593044Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:57.0593950Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:57.0594405Z plugins: hypothesis-6.131.14
2025-05-07T20:31:58.8020434Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:58.9843083Z collecting ... collected 2 items
2025-05-07T20:31:58.9844123Z 
2025-05-07T20:32:04.6527334Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:04.6528258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6528780Z     T=1,
2025-05-07T20:32:04.6529004Z     D=5120,
2025-05-07T20:32:04.6529235Z     contiguous=True,
2025-05-07T20:32:04.6529497Z     compiled=True,
2025-05-07T20:32:04.6529740Z )
2025-05-07T20:32:04.6529967Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6530408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6530891Z     T=4096,
2025-05-07T20:32:04.6531151Z     D=5120,
2025-05-07T20:32:04.6531379Z     contiguous=True,
2025-05-07T20:32:04.6531639Z     compiled=True,
2025-05-07T20:32:04.6531870Z )
2025-05-07T20:32:04.6532104Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6532531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6532993Z     T=4096,
2025-05-07T20:32:04.6533320Z     D=7168,
2025-05-07T20:32:04.6533628Z     contiguous=False,
2025-05-07T20:32:04.6533891Z     compiled=False,
2025-05-07T20:32:04.6534133Z )
2025-05-07T20:32:04.6534363Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6535357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6535900Z     T=4096,
2025-05-07T20:32:04.6536168Z     D=5120,
2025-05-07T20:32:04.6536495Z     contiguous=False,
2025-05-07T20:32:04.6536852Z     compiled=True,
2025-05-07T20:32:04.6537179Z )
2025-05-07T20:32:04.6537489Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6537960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6538391Z     T=1,
2025-05-07T20:32:04.6538607Z     D=7168,
2025-05-07T20:32:04.6538836Z     contiguous=True,
2025-05-07T20:32:04.6539094Z     compiled=True,
2025-05-07T20:32:04.6539327Z )
2025-05-07T20:32:04.6539553Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6539990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6540423Z     T=1,
2025-05-07T20:32:04.6540632Z     D=7168,
2025-05-07T20:32:04.6540860Z     contiguous=False,
2025-05-07T20:32:04.6541119Z     compiled=True,
2025-05-07T20:32:04.6541358Z )
2025-05-07T20:32:04.6543544Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6543968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6544416Z     T=4096,
2025-05-07T20:32:04.6544632Z     D=5120,
2025-05-07T20:32:04.6544863Z     contiguous=False,
2025-05-07T20:32:04.6545119Z     compiled=False,
2025-05-07T20:32:04.6545356Z )
2025-05-07T20:32:04.6545584Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6546002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6546432Z     T=1,
2025-05-07T20:32:04.6546651Z     D=7168,
2025-05-07T20:32:04.6555445Z     contiguous=True,
2025-05-07T20:32:04.6555840Z     compiled=False,
2025-05-07T20:32:04.6556104Z )
2025-05-07T20:32:04.6556342Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6556773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6557202Z     T=2048,
2025-05-07T20:32:04.6557426Z     D=5120,
2025-05-07T20:32:04.6557665Z     contiguous=True,
2025-05-07T20:32:04.6557915Z     compiled=True,
2025-05-07T20:32:04.6558150Z )
2025-05-07T20:32:04.6558376Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6558790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6559215Z     T=2048,
2025-05-07T20:32:04.6559431Z     D=7168,
2025-05-07T20:32:04.6559650Z     contiguous=True,
2025-05-07T20:32:04.6559905Z     compiled=True,
2025-05-07T20:32:04.6560138Z )
2025-05-07T20:32:04.6560357Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6560777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6561206Z     T=2048,
2025-05-07T20:32:04.6561412Z     D=7168,
2025-05-07T20:32:04.6561634Z     contiguous=True,
2025-05-07T20:32:04.6562117Z     compiled=False,
2025-05-07T20:32:04.6562348Z )
2025-05-07T20:32:04.6562571Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6562986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6563415Z     T=128,
2025-05-07T20:32:04.6563629Z     D=5120,
2025-05-07T20:32:04.6563852Z     contiguous=False,
2025-05-07T20:32:04.6564105Z     compiled=True,
2025-05-07T20:32:04.6564329Z )
2025-05-07T20:32:04.6564551Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6564967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6565383Z     T=128,
2025-05-07T20:32:04.6565595Z     D=5120,
2025-05-07T20:32:04.6565817Z     contiguous=True,
2025-05-07T20:32:04.6566064Z     compiled=True,
2025-05-07T20:32:04.6566291Z )
2025-05-07T20:32:04.6566514Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6566928Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6567350Z     T=16384,
2025-05-07T20:32:04.6567576Z     D=5120,
2025-05-07T20:32:04.6567800Z     contiguous=False,
2025-05-07T20:32:04.6568047Z     compiled=True,
2025-05-07T20:32:04.6568276Z )
2025-05-07T20:32:04.6568503Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6569013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6569433Z     T=16384,
2025-05-07T20:32:04.6569652Z     D=5120,
2025-05-07T20:32:04.6569868Z     contiguous=False,
2025-05-07T20:32:04.6570125Z     compiled=False,
2025-05-07T20:32:04.6570356Z )
2025-05-07T20:32:04.6570574Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6570992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6571424Z     T=128,
2025-05-07T20:32:04.6571665Z     D=7168,
2025-05-07T20:32:04.6571904Z     contiguous=True,
2025-05-07T20:32:04.6572161Z     compiled=False,
2025-05-07T20:32:04.6572384Z )
2025-05-07T20:32:04.6572605Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6573023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6573437Z     T=128,
2025-05-07T20:32:04.6573646Z     D=7168,
2025-05-07T20:32:04.6573868Z     contiguous=False,
2025-05-07T20:32:04.6574116Z     compiled=False,
2025-05-07T20:32:04.6574352Z )
2025-05-07T20:32:04.6574573Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6574988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6575401Z     T=1,
2025-05-07T20:32:04.6575609Z     D=5120,
2025-05-07T20:32:04.6575830Z     contiguous=False,
2025-05-07T20:32:04.6576079Z     compiled=False,
2025-05-07T20:32:04.6576309Z )
2025-05-07T20:32:04.6576531Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6576937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6577357Z     T=1,
2025-05-07T20:32:04.6577568Z     D=7168,
2025-05-07T20:32:04.6577785Z     contiguous=False,
2025-05-07T20:32:04.6578040Z     compiled=False,
2025-05-07T20:32:04.6578271Z )
2025-05-07T20:32:04.6578493Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6578907Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6579327Z     T=4096,
2025-05-07T20:32:04.6579532Z     D=5120,
2025-05-07T20:32:04.6579760Z     contiguous=True,
2025-05-07T20:32:04.6580014Z     compiled=False,
2025-05-07T20:32:04.6580242Z )
2025-05-07T20:32:04.6580463Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6580881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6581305Z     T=128,
2025-05-07T20:32:04.6581509Z     D=7168,
2025-05-07T20:32:04.6581730Z     contiguous=True,
2025-05-07T20:32:04.6581982Z     compiled=True,
2025-05-07T20:32:04.6582208Z )
2025-05-07T20:32:04.6582437Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6582854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6583268Z     T=1,
2025-05-07T20:32:04.6583476Z     D=5120,
2025-05-07T20:32:04.6583699Z     contiguous=False,
2025-05-07T20:32:04.6584046Z     compiled=True,
2025-05-07T20:32:04.6584277Z )
2025-05-07T20:32:04.6584501Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6584908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6585334Z     T=4096,
2025-05-07T20:32:04.6585545Z     D=7168,
2025-05-07T20:32:04.6585759Z     contiguous=True,
2025-05-07T20:32:04.6586015Z     compiled=False,
2025-05-07T20:32:04.6586246Z )
2025-05-07T20:32:04.6586461Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6586873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6587298Z     T=4096,
2025-05-07T20:32:04.6587500Z     D=7168,
2025-05-07T20:32:04.6587721Z     contiguous=False,
2025-05-07T20:32:04.6587974Z     compiled=True,
2025-05-07T20:32:04.6588205Z )
2025-05-07T20:32:04.6588419Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6588835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6589255Z     T=128,
2025-05-07T20:32:04.6589464Z     D=5120,
2025-05-07T20:32:04.6589684Z     contiguous=True,
2025-05-07T20:32:04.6589936Z     compiled=False,
2025-05-07T20:32:04.6590160Z )
2025-05-07T20:32:04.6590381Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6590890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6591303Z     T=128,
2025-05-07T20:32:04.6591516Z     D=5120,
2025-05-07T20:32:04.6591735Z     contiguous=False,
2025-05-07T20:32:04.6591984Z     compiled=False,
2025-05-07T20:32:04.6592219Z )
2025-05-07T20:32:04.6592439Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6592849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6593270Z     T=1,
2025-05-07T20:32:04.6593479Z     D=5120,
2025-05-07T20:32:04.6593796Z     contiguous=True,
2025-05-07T20:32:04.6594048Z     compiled=False,
2025-05-07T20:32:04.6594289Z )
2025-05-07T20:32:04.6594502Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6594921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6595346Z     T=2048,
2025-05-07T20:32:04.6595559Z     D=7168,
2025-05-07T20:32:04.6595779Z     contiguous=False,
2025-05-07T20:32:04.6596034Z     compiled=True,
2025-05-07T20:32:04.6596269Z )
2025-05-07T20:32:04.6596486Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6596905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6597328Z     T=2048,
2025-05-07T20:32:04.6597533Z     D=7168,
2025-05-07T20:32:04.6597766Z     contiguous=False,
2025-05-07T20:32:04.6598020Z     compiled=False,
2025-05-07T20:32:04.6598245Z )
2025-05-07T20:32:04.6598469Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6598886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6599301Z     T=16384,
2025-05-07T20:32:04.6599525Z     D=7168,
2025-05-07T20:32:04.6599747Z     contiguous=False,
2025-05-07T20:32:04.6599994Z     compiled=True,
2025-05-07T20:32:04.6600222Z )
2025-05-07T20:32:04.6600446Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6600853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6601275Z     T=16384,
2025-05-07T20:32:04.6601490Z     D=7168,
2025-05-07T20:32:04.6601710Z     contiguous=True,
2025-05-07T20:32:04.6601960Z     compiled=True,
2025-05-07T20:32:04.6602188Z )
2025-05-07T20:32:04.6602409Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6602818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6603237Z     T=4096,
2025-05-07T20:32:04.6603449Z     D=7168,
2025-05-07T20:32:04.6603662Z     contiguous=True,
2025-05-07T20:32:04.6603915Z     compiled=True,
2025-05-07T20:32:04.6604144Z )
2025-05-07T20:32:04.6604361Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6604775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6605203Z     T=2048,
2025-05-07T20:32:04.6605408Z     D=5120,
2025-05-07T20:32:04.6605737Z     contiguous=False,
2025-05-07T20:32:04.6605993Z     compiled=False,
2025-05-07T20:32:04.6606218Z )
2025-05-07T20:32:04.6606440Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6606855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6607287Z     T=2048,
2025-05-07T20:32:04.6607492Z     D=5120,
2025-05-07T20:32:04.6607706Z     contiguous=True,
2025-05-07T20:32:04.6607947Z     compiled=False,
2025-05-07T20:32:04.6608172Z )
2025-05-07T20:32:04.6608388Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6608797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6609212Z     T=128,
2025-05-07T20:32:04.6609417Z     D=7168,
2025-05-07T20:32:04.6609676Z     contiguous=False,
2025-05-07T20:32:04.6610038Z     compiled=True,
2025-05-07T20:32:04.6610275Z )
2025-05-07T20:32:04.6610498Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6610908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6611470Z     T=16384,
2025-05-07T20:32:04.6611696Z     D=5120,
2025-05-07T20:32:04.6611910Z     contiguous=True,
2025-05-07T20:32:04.6612160Z     compiled=True,
2025-05-07T20:32:04.6612391Z )
2025-05-07T20:32:04.6612609Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6613144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6613569Z     T=2048,
2025-05-07T20:32:04.6613779Z     D=5120,
2025-05-07T20:32:04.6614003Z     contiguous=False,
2025-05-07T20:32:04.6614264Z     compiled=True,
2025-05-07T20:32:04.6614492Z )
2025-05-07T20:32:04.6614720Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6615142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6615563Z     T=16384,
2025-05-07T20:32:04.6615784Z     D=5120,
2025-05-07T20:32:04.6616013Z     contiguous=True,
2025-05-07T20:32:04.6616264Z     compiled=False,
2025-05-07T20:32:04.6616501Z )
2025-05-07T20:32:04.6616737Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6617163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6617585Z     T=16384,
2025-05-07T20:32:04.6617807Z     D=7168,
2025-05-07T20:32:04.6618032Z     contiguous=False,
2025-05-07T20:32:04.6618292Z     compiled=False,
2025-05-07T20:32:04.6618527Z )
2025-05-07T20:32:04.6618763Z Trying example: test_silu_mul(
2025-05-07T20:32:04.6619175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:04.6619603Z     T=16384,
2025-05-07T20:32:04.6619825Z     D=7168,
2025-05-07T20:32:04.6620043Z     contiguous=True,
2025-05-07T20:32:04.6620299Z     compiled=False,
2025-05-07T20:32:04.6620543Z )
2025-05-07T20:32:04.6620742Z PASSED
2025-05-07T20:32:04.7242804Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.7244158Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:04.7245673Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.7247288Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.7248834Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.7250390Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.7252053Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.7253604Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.7255194Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.7256592Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:04.7257977Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.7259337Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:04.7260659Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.7261801Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:04.7263167Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.7264595Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.7265853Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.7267029Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:04.7268349Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.7269862Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.7271053Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.7272079Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.7272919Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:04.7274160Z W0507 20:32:04.722000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.7421809Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.7423014Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:04.7425071Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.7426823Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.7428513Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.7430074Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.7431553Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.7433275Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.7434974Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.7436382Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:04.7437760Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.7439117Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:04.7440297Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.7441448Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:04.7442825Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.7444277Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.7445531Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.7446712Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:04.7448040Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.7449572Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.7450914Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.7451942Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.7452796Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:04.7453953Z W0507 20:32:04.741000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.7867171Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.7868497Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:04.7870003Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.7871829Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.7873376Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.7875002Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.7876464Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.7877997Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.7879592Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.7880989Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:04.7882361Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.7883714Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:04.7884878Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.7886023Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:04.7887387Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.7888822Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.7890201Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.7891367Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:04.7892736Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.7894258Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.7895444Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.7896468Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.7897292Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:04.7898542Z W0507 20:32:04.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.7914592Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.7915935Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:04.7917450Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.7919052Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.7920616Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.7922174Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.7923648Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.7925503Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.7927095Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.7928496Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:04.7929869Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.7931420Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:04.7932594Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:04.7933741Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:04.7935115Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.7936683Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.7938029Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:04.7939207Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:04.7940685Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.7942217Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.7943415Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.7944450Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.7945291Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:04.7946445Z W0507 20:32:04.790000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2770325Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.2771108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.2771584Z     T=1,
2025-05-07T20:32:05.2771806Z     D=5120,
2025-05-07T20:32:05.2772025Z     scale_ub=None,
2025-05-07T20:32:05.2772274Z     contiguous=True,
2025-05-07T20:32:05.2772532Z     compiled=True,
2025-05-07T20:32:05.2772769Z )
2025-05-07T20:32:05.2773167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.2773725Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.2774020Z 
2025-05-07T20:32:05.2774120Z     @given(
2025-05-07T20:32:05.2774393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.2774755Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.2775108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.2775479Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.2775860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.2776194Z     )
2025-05-07T20:32:05.2776595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.2777112Z     def test_silu_mul_quant(
2025-05-07T20:32:05.2777398Z         self,
2025-05-07T20:32:05.2777618Z         T: int,
2025-05-07T20:32:05.2777852Z         D: int,
2025-05-07T20:32:05.2778107Z         scale_ub: Optional[float],
2025-05-07T20:32:05.2778777Z         contiguous: bool,
2025-05-07T20:32:05.2779054Z         compiled: bool,
2025-05-07T20:32:05.2779318Z     ) -> None:
2025-05-07T20:32:05.2779574Z         torch.manual_seed(2025)
2025-05-07T20:32:05.2779848Z     
2025-05-07T20:32:05.2780181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.2780572Z     
2025-05-07T20:32:05.2780797Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.2781124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.2781480Z         x = x_sign * x_clamp
2025-05-07T20:32:05.2781757Z         x0 = x[:, :D]
2025-05-07T20:32:05.2782000Z         x1 = x[:, D:]
2025-05-07T20:32:05.2782243Z     
2025-05-07T20:32:05.2782456Z         if contiguous:
2025-05-07T20:32:05.2782718Z             x0 = x0.contiguous()
2025-05-07T20:32:05.2792943Z             x1 = x1.contiguous()
2025-05-07T20:32:05.2793370Z     
2025-05-07T20:32:05.2793823Z         if scale_ub is not None:
2025-05-07T20:32:05.2794297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.2794717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.2795070Z             )
2025-05-07T20:32:05.2795313Z         else:
2025-05-07T20:32:05.2795778Z             scale_ub_tensor = None
2025-05-07T20:32:05.2796064Z     
2025-05-07T20:32:05.2796338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.2796701Z             op = silu_mul_quant
2025-05-07T20:32:05.2796988Z             if compiled:
2025-05-07T20:32:05.2797282Z                 op = torch.compile(op)
2025-05-07T20:32:05.2797625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.2797943Z     
2025-05-07T20:32:05.2798165Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.2798497Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.2798832Z     
2025-05-07T20:32:05.2799102Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.2799490Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.2799833Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.2800185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.2800596Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.2800960Z     
2025-05-07T20:32:05.2801188Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.2801417Z 
2025-05-07T20:32:05.2801535Z moe/activation_test.py:126: 
2025-05-07T20:32:05.2801879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2802267Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.2802639Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.2803530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.2804386Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.2805005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.2805781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.2806556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.2807381Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.2808224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:05.2809068Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.2809887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.2810613Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.2811381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.2811972Z     fn()
2025-05-07T20:32:05.2812550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.2813206Z     self.fn.run(
2025-05-07T20:32:05.2813741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.2814341Z     kernel = self.compile(
2025-05-07T20:32:05.2814952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.2815685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2816135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.2816397Z 
2025-05-07T20:32:05.2816636Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a4ce6680>
2025-05-07T20:32:05.2817857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.2819492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a4c60550>}
2025-05-07T20:32:05.2820998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.2822154Z context = <triton._C.libtriton.ir.context object at 0x7f76d94248f0>
2025-05-07T20:32:05.2822477Z 
2025-05-07T20:32:05.2822674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.2823258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2824163Z                            module_map=module_map)
2025-05-07T20:32:05.2824602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2825006Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.2825311Z E       ^
2025-05-07T20:32:05.2825836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2826343Z 
2025-05-07T20:32:05.2826812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.2827387Z 
2025-05-07T20:32:05.2827507Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.2827973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.2828425Z     T=2048,
2025-05-07T20:32:05.2828641Z     D=5120,
2025-05-07T20:32:05.2828859Z     scale_ub=1200.0,
2025-05-07T20:32:05.2829115Z     contiguous=True,
2025-05-07T20:32:05.2829377Z     compiled=False,
2025-05-07T20:32:05.2829606Z )
2025-05-07T20:32:05.8693314Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.8694653Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:05.8696124Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.8697688Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.8699400Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.8700924Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8702445Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.8703965Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8705530Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.8706910Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:05.8708385Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.8709718Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:05.8710858Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:05.8711980Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:05.8713325Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.8714865Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.8716102Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:05.8717255Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:05.8718570Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.8720075Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.8721256Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.8722266Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.8723080Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:05.8724467Z W0507 20:32:05.865000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.0679815Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.0680998Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:06.0682466Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.0684025Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.0685540Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.0687057Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.0688619Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.0690125Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.0691690Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.0693052Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:06.0694396Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.0695721Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:06.0696859Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:06.0697978Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:06.0699320Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.0700733Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.0701966Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:06.0703113Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:06.0704524Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.0706014Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.0707185Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.0708190Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.0709008Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:06.0710124Z W0507 20:32:06.064000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6082065Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.6083257Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:06.6084956Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.6086554Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.6088095Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.6089631Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.6091083Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.6092612Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.6094185Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.6095571Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:06.6096919Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.6098258Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:06.6099411Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:06.6100541Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:06.6102072Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.6103501Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.6104737Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:06.6105894Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:06.6107200Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.6108711Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.6109883Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.6110980Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.6111807Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:06.6112991Z W0507 20:32:06.604000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6411216Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.6412408Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:06.6413893Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.6415475Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.6417012Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.6418562Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.6420019Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.6421538Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.6423109Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.6424831Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:06.6426187Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.6427531Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:06.6428678Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:06.6429805Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:06.6431162Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.6432580Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.6434017Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:06.6435165Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:06.6436470Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.6437986Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.6439164Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.6440173Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.6440995Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:06.6442132Z W0507 20:32:06.637000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3445240Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.3445918Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:07.3446353Z 
2025-05-07T20:32:07.3446478Z     @given(
2025-05-07T20:32:07.3446826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.3447272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.3447611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.3447966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.3448318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.3448618Z     )
2025-05-07T20:32:07.3448993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.3449465Z     def test_silu_mul_quant(
2025-05-07T20:32:07.3449720Z         self,
2025-05-07T20:32:07.3449933Z         T: int,
2025-05-07T20:32:07.3450150Z         D: int,
2025-05-07T20:32:07.3450383Z         scale_ub: Optional[float],
2025-05-07T20:32:07.3450678Z         contiguous: bool,
2025-05-07T20:32:07.3450941Z         compiled: bool,
2025-05-07T20:32:07.3451357Z     ) -> None:
2025-05-07T20:32:07.3451601Z         torch.manual_seed(2025)
2025-05-07T20:32:07.3451867Z     
2025-05-07T20:32:07.3452155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.3452523Z     
2025-05-07T20:32:07.3452734Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.3453040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.3453373Z         x = x_sign * x_clamp
2025-05-07T20:32:07.3453633Z         x0 = x[:, :D]
2025-05-07T20:32:07.3453866Z         x1 = x[:, D:]
2025-05-07T20:32:07.3454083Z     
2025-05-07T20:32:07.3454282Z         if contiguous:
2025-05-07T20:32:07.3454536Z             x0 = x0.contiguous()
2025-05-07T20:32:07.3454806Z             x1 = x1.contiguous()
2025-05-07T20:32:07.3455066Z     
2025-05-07T20:32:07.3455277Z         if scale_ub is not None:
2025-05-07T20:32:07.3455568Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.3455929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.3456272Z             )
2025-05-07T20:32:07.3456475Z         else:
2025-05-07T20:32:07.3456707Z             scale_ub_tensor = None
2025-05-07T20:32:07.3456981Z     
2025-05-07T20:32:07.3457225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3457720Z             op = silu_mul_quant
2025-05-07T20:32:07.3457990Z             if compiled:
2025-05-07T20:32:07.3458252Z                 op = torch.compile(op)
2025-05-07T20:32:07.3458572Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3458871Z     
2025-05-07T20:32:07.3459074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.3459257Z 
2025-05-07T20:32:07.3459363Z moe/activation_test.py:117: 
2025-05-07T20:32:07.3459720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3460072Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.3460382Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3461123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.3461849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.3462418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.3463371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.3464101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.3464671Z     kernel = self.compile(
2025-05-07T20:32:07.3465247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.3465945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.3466362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3466613Z 
2025-05-07T20:32:07.3466839Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a5262410>
2025-05-07T20:32:07.3467982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.3469443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a53d89d0>}
2025-05-07T20:32:07.3470860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.3471952Z context = <triton._C.libtriton.ir.context object at 0x7f76a4d413f0>
2025-05-07T20:32:07.3472296Z 
2025-05-07T20:32:07.3472473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.3473127Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.3473737Z                            module_map=module_map)
2025-05-07T20:32:07.3474129Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3474501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.3474776Z E       ^
2025-05-07T20:32:07.3475260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3475741Z 
2025-05-07T20:32:07.3476178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.3476713Z 
2025-05-07T20:32:07.3476830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.3477265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.3477684Z     T=2048,
2025-05-07T20:32:07.3477886Z     D=5120,
2025-05-07T20:32:07.3478099Z     scale_ub=1200.0,
2025-05-07T20:32:07.3478338Z     contiguous=True,
2025-05-07T20:32:07.3478578Z     compiled=True,
2025-05-07T20:32:07.3478803Z )
2025-05-07T20:32:07.3479139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.3479820Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:07.3480148Z 
2025-05-07T20:32:07.3480241Z     @given(
2025-05-07T20:32:07.3480498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.3480860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.3481216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.3481598Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.3481976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.3482315Z     )
2025-05-07T20:32:07.3482730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.3483261Z     def test_silu_mul_quant(
2025-05-07T20:32:07.3483538Z         self,
2025-05-07T20:32:07.3483755Z         T: int,
2025-05-07T20:32:07.3483968Z         D: int,
2025-05-07T20:32:07.3484214Z         scale_ub: Optional[float],
2025-05-07T20:32:07.3484528Z         contiguous: bool,
2025-05-07T20:32:07.3484780Z         compiled: bool,
2025-05-07T20:32:07.3485022Z     ) -> None:
2025-05-07T20:32:07.3485255Z         torch.manual_seed(2025)
2025-05-07T20:32:07.3485507Z     
2025-05-07T20:32:07.3485797Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.3486159Z     
2025-05-07T20:32:07.3486362Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.3486674Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.3487004Z         x = x_sign * x_clamp
2025-05-07T20:32:07.3487263Z         x0 = x[:, :D]
2025-05-07T20:32:07.3487489Z         x1 = x[:, D:]
2025-05-07T20:32:07.3487715Z     
2025-05-07T20:32:07.3487917Z         if contiguous:
2025-05-07T20:32:07.3488164Z             x0 = x0.contiguous()
2025-05-07T20:32:07.3488441Z             x1 = x1.contiguous()
2025-05-07T20:32:07.3488700Z     
2025-05-07T20:32:07.3488904Z         if scale_ub is not None:
2025-05-07T20:32:07.3489196Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.3489560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.3489882Z             )
2025-05-07T20:32:07.3490094Z         else:
2025-05-07T20:32:07.3490323Z             scale_ub_tensor = None
2025-05-07T20:32:07.3490588Z     
2025-05-07T20:32:07.3490839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3491172Z             op = silu_mul_quant
2025-05-07T20:32:07.3491432Z             if compiled:
2025-05-07T20:32:07.3491701Z                 op = torch.compile(op)
2025-05-07T20:32:07.3492019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3492316Z     
2025-05-07T20:32:07.3492518Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.3492915Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.3493227Z     
2025-05-07T20:32:07.3493477Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3493832Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.3494150Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.3494479Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.3494861Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.3495192Z     
2025-05-07T20:32:07.3495405Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.3495618Z 
2025-05-07T20:32:07.3495726Z moe/activation_test.py:126: 
2025-05-07T20:32:07.3496046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3496401Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.3496747Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.3497578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.3498369Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.3498941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.3499794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.3500518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.3501280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.3502068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:07.3502858Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.3503628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.3504301Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.3504930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.3505480Z     fn()
2025-05-07T20:32:07.3506016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.3506621Z     self.fn.run(
2025-05-07T20:32:07.3507114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.3507674Z     kernel = self.compile(
2025-05-07T20:32:07.3508242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.3508925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.3509350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3509590Z 
2025-05-07T20:32:07.3509812Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a5228730>
2025-05-07T20:32:07.3510939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.3512369Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a536f1c0>}
2025-05-07T20:32:07.3513831Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.3514907Z context = <triton._C.libtriton.ir.context object at 0x7f767d3fe4b0>
2025-05-07T20:32:07.3515292Z 
2025-05-07T20:32:07.3515479Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.3516032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.3516540Z                            module_map=module_map)
2025-05-07T20:32:07.3516926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3517302Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.3517581Z E       ^
2025-05-07T20:32:07.3518073Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3518544Z 
2025-05-07T20:32:07.3518998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.3519534Z 
2025-05-07T20:32:07.3519650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.3520092Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.3520521Z     T=16384,
2025-05-07T20:32:07.3520734Z     D=7168,
2025-05-07T20:32:07.3520939Z     scale_ub=1200.0,
2025-05-07T20:32:07.3521187Z     contiguous=False,
2025-05-07T20:32:07.3521519Z     compiled=False,
2025-05-07T20:32:07.3521732Z )
2025-05-07T20:32:07.7362218Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.7363347Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:07.7364742Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.7366257Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.7367702Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.7369149Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.7370512Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.7371945Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.7373420Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.7374719Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:07.7375975Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.7377230Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:07.7378472Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:07.7387101Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:07.7388527Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.7389883Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.7391046Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:07.7392130Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:07.7393362Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.7395026Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.7396142Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.7397097Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.7397877Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:07.7398956Z W0507 20:32:07.732000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8838393Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:07.8839521Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:07.8840928Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:07.8842426Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:07.8843874Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:07.8845331Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.8846696Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:07.8848295Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.8849779Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:07.8851074Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:07.8852346Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:07.8853610Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:07.8854694Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:07.8855762Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:07.8857147Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:07.8858485Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:07.8859652Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:07.8860747Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:07.8861972Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:07.8863445Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:07.8864558Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.8865509Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.8866294Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:07.8867358Z W0507 20:32:07.880000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3428016Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.3429170Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:08.3430581Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.3432255Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.3433777Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.3435227Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3436587Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.3438017Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3439488Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.3440910Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:08.3442174Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.3443432Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:08.3444515Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:08.3445569Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:08.3446836Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.3448175Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.3449339Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:08.3450423Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:08.3451645Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.3453059Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.3454162Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3455109Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3455884Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:08.3457023Z W0507 20:32:08.339000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3735529Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:08.3736650Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:08.3738039Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:08.3739519Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:08.3740952Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:08.3742565Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3743916Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:08.3745355Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3746838Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:08.3748138Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:08.3749410Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:08.3750670Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:08.3751750Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:08.3752817Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:08.3754146Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:08.3755490Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:08.3756653Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:08.3757742Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:08.3759118Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:08.3760528Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:08.3761640Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3762590Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3763365Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:08.3764440Z W0507 20:32:08.370000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7728932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7729558Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.7730425Z 
2025-05-07T20:32:09.7730548Z     @given(
2025-05-07T20:32:09.7730886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7731325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7731760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7732186Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7732539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7732857Z     )
2025-05-07T20:32:09.7733243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7733728Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7733995Z         self,
2025-05-07T20:32:09.7734212Z         T: int,
2025-05-07T20:32:09.7734446Z         D: int,
2025-05-07T20:32:09.7734683Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7734984Z         contiguous: bool,
2025-05-07T20:32:09.7735250Z         compiled: bool,
2025-05-07T20:32:09.7735506Z     ) -> None:
2025-05-07T20:32:09.7735747Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7736017Z     
2025-05-07T20:32:09.7736313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7736688Z     
2025-05-07T20:32:09.7736903Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7737214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7737556Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7737823Z         x0 = x[:, :D]
2025-05-07T20:32:09.7738062Z         x1 = x[:, D:]
2025-05-07T20:32:09.7738296Z     
2025-05-07T20:32:09.7738501Z         if contiguous:
2025-05-07T20:32:09.7738762Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7739043Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7739316Z     
2025-05-07T20:32:09.7739529Z         if scale_ub is not None:
2025-05-07T20:32:09.7739823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7740188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7740533Z             )
2025-05-07T20:32:09.7740742Z         else:
2025-05-07T20:32:09.7740977Z             scale_ub_tensor = None
2025-05-07T20:32:09.7741253Z     
2025-05-07T20:32:09.7741504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7741853Z             op = silu_mul_quant
2025-05-07T20:32:09.7742129Z             if compiled:
2025-05-07T20:32:09.7742395Z                 op = torch.compile(op)
2025-05-07T20:32:09.7742718Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7743023Z     
2025-05-07T20:32:09.7743233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.7743419Z 
2025-05-07T20:32:09.7743530Z moe/activation_test.py:117: 
2025-05-07T20:32:09.7744020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7744386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.7744690Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7745441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.7746197Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.7746774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7747516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7748232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7748810Z     kernel = self.compile(
2025-05-07T20:32:09.7749391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7750106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7750539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7750785Z 
2025-05-07T20:32:09.7751102Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524430>
2025-05-07T20:32:09.7752256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7756071Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a4cb7ac0>}
2025-05-07T20:32:09.7757506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7758592Z context = <triton._C.libtriton.ir.context object at 0x7f767d43b130>
2025-05-07T20:32:09.7758901Z 
2025-05-07T20:32:09.7759090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7759651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7760157Z                            module_map=module_map)
2025-05-07T20:32:09.7760556Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7760938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.7761215Z E       ^
2025-05-07T20:32:09.7761716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7762195Z 
2025-05-07T20:32:09.7762644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7763188Z 
2025-05-07T20:32:09.7763310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7763747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7764177Z     T=1,
2025-05-07T20:32:09.7764380Z     D=7168,
2025-05-07T20:32:09.7764592Z     scale_ub=None,
2025-05-07T20:32:09.7764824Z     contiguous=True,
2025-05-07T20:32:09.7765067Z     compiled=True,
2025-05-07T20:32:09.7765289Z )
2025-05-07T20:32:09.7765633Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7766153Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:09.7766430Z 
2025-05-07T20:32:09.7766514Z     @given(
2025-05-07T20:32:09.7766767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7767130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7767454Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7767813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7768277Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7768585Z     )
2025-05-07T20:32:09.7768967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7769445Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7769715Z         self,
2025-05-07T20:32:09.7769924Z         T: int,
2025-05-07T20:32:09.7770144Z         D: int,
2025-05-07T20:32:09.7770384Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7770673Z         contiguous: bool,
2025-05-07T20:32:09.7770937Z         compiled: bool,
2025-05-07T20:32:09.7771184Z     ) -> None:
2025-05-07T20:32:09.7771413Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7771679Z     
2025-05-07T20:32:09.7771978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7772340Z     
2025-05-07T20:32:09.7772554Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7772870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7773204Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7773489Z         x0 = x[:, :D]
2025-05-07T20:32:09.7773755Z         x1 = x[:, D:]
2025-05-07T20:32:09.7774002Z     
2025-05-07T20:32:09.7774206Z         if contiguous:
2025-05-07T20:32:09.7774463Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7774829Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7775094Z     
2025-05-07T20:32:09.7775309Z         if scale_ub is not None:
2025-05-07T20:32:09.7775607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7775968Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7776308Z             )
2025-05-07T20:32:09.7776520Z         else:
2025-05-07T20:32:09.7776747Z             scale_ub_tensor = None
2025-05-07T20:32:09.7777026Z     
2025-05-07T20:32:09.7777281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7777622Z             op = silu_mul_quant
2025-05-07T20:32:09.7777896Z             if compiled:
2025-05-07T20:32:09.7778172Z                 op = torch.compile(op)
2025-05-07T20:32:09.7778491Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7778822Z     
2025-05-07T20:32:09.7779037Z         y_fp8, y_scale = fn()
2025-05-07T20:32:09.7779354Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:09.7779675Z     
2025-05-07T20:32:09.7779937Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7780302Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:09.7780616Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:09.7780956Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:09.7781349Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7781683Z     
2025-05-07T20:32:09.7781909Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:09.7782132Z 
2025-05-07T20:32:09.7782242Z moe/activation_test.py:126: 
2025-05-07T20:32:09.7782567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7782930Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:09.7783294Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:09.7784143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:09.7784955Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:09.7785547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7786286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7787029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:09.7787804Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.7788709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:09.7789518Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:09.7790307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:09.7791006Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:09.7791660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:09.7792222Z     fn()
2025-05-07T20:32:09.7792765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:09.7793395Z     self.fn.run(
2025-05-07T20:32:09.7794027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7794604Z     kernel = self.compile(
2025-05-07T20:32:09.7795188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7795895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7796326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7796694Z 
2025-05-07T20:32:09.7796922Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a6e7c040>
2025-05-07T20:32:09.7798075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7799550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76943d30a0>}
2025-05-07T20:32:09.7800990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7802088Z context = <triton._C.libtriton.ir.context object at 0x7f767f028370>
2025-05-07T20:32:09.7802403Z 
2025-05-07T20:32:09.7802588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7803202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7803710Z                            module_map=module_map)
2025-05-07T20:32:09.7804108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7804490Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:09.7804779Z E       ^
2025-05-07T20:32:09.7805282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7805765Z 
2025-05-07T20:32:09.7806214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7806772Z 
2025-05-07T20:32:09.7806886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7807333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7807780Z     T=4096,
2025-05-07T20:32:09.7807982Z     D=5120,
2025-05-07T20:32:09.7808196Z     scale_ub=None,
2025-05-07T20:32:09.7808437Z     contiguous=False,
2025-05-07T20:32:09.7808679Z     compiled=False,
2025-05-07T20:32:09.7808903Z )
2025-05-07T20:32:10.3861164Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:10.3862385Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:10.3864337Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:10.3865973Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:10.3867542Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:10.3869097Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3870574Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:10.3872120Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3873944Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:10.3875335Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:10.3876705Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:10.3878067Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:10.3879230Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:10.3880381Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:10.3881750Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:10.3883247Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:10.3884506Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:10.3885680Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:10.3887006Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:10.3888524Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:10.3889716Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3890825Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.3891660Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:10.3892813Z W0507 20:32:10.382000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.0144363Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.0146475Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:11.0149144Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.0151969Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.0154661Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.0156589Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.0158408Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.0160319Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.0162301Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.0164020Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:11.0165707Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.0167379Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:11.0168805Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:11.0170210Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:11.0171891Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.0173914Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.0175845Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:11.0177280Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:11.0178916Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.0180793Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.0182249Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.0183720Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.0184938Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:11.0186332Z W0507 20:32:11.010000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7899060Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.7900290Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:11.7901815Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.7903458Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.7905017Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.7906568Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7908029Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.7909565Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7911147Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.7912547Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:11.7914077Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.7915707Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:11.7916864Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:11.7918019Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:11.7919395Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.7920843Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.7922104Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:11.7923267Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:11.7924895Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.7926589Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.7927771Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7928783Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7929621Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:11.7930768Z W0507 20:32:11.786000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.8232198Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:11.8233407Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:11.8234985Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:11.8236591Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:11.8238159Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:11.8239721Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.8248310Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.8250079Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.8251699Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:11.8253112Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:11.8254498Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:11.8255877Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:11.8257061Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:11.8258221Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:11.8259730Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:11.8261181Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:11.8262449Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:11.8263645Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:11.8265026Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:11.8266562Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:11.8267763Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.8268797Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.8269650Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:11.8270806Z W0507 20:32:11.819000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4241847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4242524Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.4242829Z 
2025-05-07T20:32:15.4242920Z     @given(
2025-05-07T20:32:15.4243190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4243532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4243870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4244231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4244581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4244899Z     )
2025-05-07T20:32:15.4245650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4246132Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4246392Z         self,
2025-05-07T20:32:15.4246613Z         T: int,
2025-05-07T20:32:15.4246843Z         D: int,
2025-05-07T20:32:15.4247077Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4247376Z         contiguous: bool,
2025-05-07T20:32:15.4247643Z         compiled: bool,
2025-05-07T20:32:15.4247888Z     ) -> None:
2025-05-07T20:32:15.4248129Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4248398Z     
2025-05-07T20:32:15.4248691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4249068Z     
2025-05-07T20:32:15.4249281Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4249592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4249927Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4250189Z         x0 = x[:, :D]
2025-05-07T20:32:15.4250416Z         x1 = x[:, D:]
2025-05-07T20:32:15.4250648Z     
2025-05-07T20:32:15.4250851Z         if contiguous:
2025-05-07T20:32:15.4251096Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4251374Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4251807Z     
2025-05-07T20:32:15.4252009Z         if scale_ub is not None:
2025-05-07T20:32:15.4252303Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4252663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4252986Z             )
2025-05-07T20:32:15.4253197Z         else:
2025-05-07T20:32:15.4253425Z             scale_ub_tensor = None
2025-05-07T20:32:15.4253690Z     
2025-05-07T20:32:15.4253941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4254280Z             op = silu_mul_quant
2025-05-07T20:32:15.4254545Z             if compiled:
2025-05-07T20:32:15.4254812Z                 op = torch.compile(op)
2025-05-07T20:32:15.4255132Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4255426Z     
2025-05-07T20:32:15.4255632Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4255814Z 
2025-05-07T20:32:15.4255927Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4256245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4256597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4256899Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4257634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4258360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4258931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4259657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4260358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4260920Z     kernel = self.compile(
2025-05-07T20:32:15.4261495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4262192Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4262621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4262862Z 
2025-05-07T20:32:15.4263080Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524040>
2025-05-07T20:32:15.4264226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4265712Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a7370c10>}
2025-05-07T20:32:15.4267200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4268272Z context = <triton._C.libtriton.ir.context object at 0x7f767f0572f0>
2025-05-07T20:32:15.4268580Z 
2025-05-07T20:32:15.4268754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4269298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4269787Z                            module_map=module_map)
2025-05-07T20:32:15.4270166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4270536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4270811Z E       ^
2025-05-07T20:32:15.4271291Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4271775Z 
2025-05-07T20:32:15.4272210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4272752Z 
2025-05-07T20:32:15.4272861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4273379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4273873Z     T=4096,
2025-05-07T20:32:15.4274073Z     D=7168,
2025-05-07T20:32:15.4274282Z     scale_ub=None,
2025-05-07T20:32:15.4274508Z     contiguous=False,
2025-05-07T20:32:15.4274749Z     compiled=False,
2025-05-07T20:32:15.4274973Z )
2025-05-07T20:32:15.4275307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4275833Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.4276126Z 
2025-05-07T20:32:15.4276208Z     @given(
2025-05-07T20:32:15.4276450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4276783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4277108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4277455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4277808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4278112Z     )
2025-05-07T20:32:15.4278480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4278939Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4279200Z         self,
2025-05-07T20:32:15.4279408Z         T: int,
2025-05-07T20:32:15.4279619Z         D: int,
2025-05-07T20:32:15.4279847Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4280136Z         contiguous: bool,
2025-05-07T20:32:15.4280394Z         compiled: bool,
2025-05-07T20:32:15.4280622Z     ) -> None:
2025-05-07T20:32:15.4280848Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4281101Z     
2025-05-07T20:32:15.4281381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4281742Z     
2025-05-07T20:32:15.4281949Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4282249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4282572Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4282833Z         x0 = x[:, :D]
2025-05-07T20:32:15.4283057Z         x1 = x[:, D:]
2025-05-07T20:32:15.4283279Z     
2025-05-07T20:32:15.4283475Z         if contiguous:
2025-05-07T20:32:15.4283716Z             x0 = x0.contiguous()
2025-05-07T20:32:15.4283989Z             x1 = x1.contiguous()
2025-05-07T20:32:15.4284244Z     
2025-05-07T20:32:15.4284465Z         if scale_ub is not None:
2025-05-07T20:32:15.4284785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.4285161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.4285487Z             )
2025-05-07T20:32:15.4285688Z         else:
2025-05-07T20:32:15.4285916Z             scale_ub_tensor = None
2025-05-07T20:32:15.4286186Z     
2025-05-07T20:32:15.4286511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.4286848Z             op = silu_mul_quant
2025-05-07T20:32:15.4287120Z             if compiled:
2025-05-07T20:32:15.4287377Z                 op = torch.compile(op)
2025-05-07T20:32:15.4287697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4287992Z     
2025-05-07T20:32:15.4288197Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.4288377Z 
2025-05-07T20:32:15.4288482Z moe/activation_test.py:117: 
2025-05-07T20:32:15.4288798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4289149Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.4289445Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.4290171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.4290899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.4291464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.4292183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.4292992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.4293554Z     kernel = self.compile(
2025-05-07T20:32:15.4294118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.4294812Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.4295231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.4295472Z 
2025-05-07T20:32:15.4295690Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524c70>
2025-05-07T20:32:15.4296821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.4298253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a5ea7520>}
2025-05-07T20:32:15.4299660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.4300732Z context = <triton._C.libtriton.ir.context object at 0x7f767f0b22b0>
2025-05-07T20:32:15.4301032Z 
2025-05-07T20:32:15.4301207Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.4301758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.4302249Z                            module_map=module_map)
2025-05-07T20:32:15.4302642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.4303008Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.4303284Z E       ^
2025-05-07T20:32:15.4303773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.4304252Z 
2025-05-07T20:32:15.4304739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.4305279Z 
2025-05-07T20:32:15.4305389Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.4305824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.4306245Z     T=128,
2025-05-07T20:32:15.4306438Z     D=7168,
2025-05-07T20:32:15.4306647Z     scale_ub=None,
2025-05-07T20:32:15.4306876Z     contiguous=False,
2025-05-07T20:32:15.4307111Z     compiled=True,
2025-05-07T20:32:15.4307327Z )
2025-05-07T20:32:15.4988161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.4989274Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:15.4989838Z 
2025-05-07T20:32:15.4990002Z     @given(
2025-05-07T20:32:15.4990496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.4991183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.4991815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.4992513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.4993205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.4993925Z     )
2025-05-07T20:32:15.4994500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.4994968Z     def test_silu_mul_quant(
2025-05-07T20:32:15.4995226Z         self,
2025-05-07T20:32:15.4995429Z         T: int,
2025-05-07T20:32:15.4995644Z         D: int,
2025-05-07T20:32:15.4995878Z         scale_ub: Optional[float],
2025-05-07T20:32:15.4996172Z         contiguous: bool,
2025-05-07T20:32:15.4996430Z         compiled: bool,
2025-05-07T20:32:15.4996673Z     ) -> None:
2025-05-07T20:32:15.4996895Z         torch.manual_seed(2025)
2025-05-07T20:32:15.4997288Z     
2025-05-07T20:32:15.4997578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.4997933Z     
2025-05-07T20:32:15.4998143Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.4998451Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.4998772Z         x = x_sign * x_clamp
2025-05-07T20:32:15.4999027Z         x0 = x[:, :D]
2025-05-07T20:32:15.4999259Z         x1 = x[:, D:]
2025-05-07T20:32:15.4999485Z     
2025-05-07T20:32:15.4999675Z         if contiguous:
2025-05-07T20:32:15.4999922Z             x0 = x0.contiguous()
2025-05-07T20:32:15.5000201Z             x1 = x1.contiguous()
2025-05-07T20:32:15.5000450Z     
2025-05-07T20:32:15.5000658Z         if scale_ub is not None:
2025-05-07T20:32:15.5000952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.5001300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.5001628Z             )
2025-05-07T20:32:15.5001835Z         else:
2025-05-07T20:32:15.5002061Z             scale_ub_tensor = None
2025-05-07T20:32:15.5002328Z     
2025-05-07T20:32:15.5002574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5002899Z             op = silu_mul_quant
2025-05-07T20:32:15.5003163Z             if compiled:
2025-05-07T20:32:15.5003431Z                 op = torch.compile(op)
2025-05-07T20:32:15.5003738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.5004032Z     
2025-05-07T20:32:15.5004237Z         y_fp8, y_scale = fn()
2025-05-07T20:32:15.5004587Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:15.5004892Z     
2025-05-07T20:32:15.5005144Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.5005501Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:15.5005807Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:15.5006138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:15.5006514Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.5006845Z     
2025-05-07T20:32:15.5007069Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:15.5007274Z 
2025-05-07T20:32:15.5007385Z moe/activation_test.py:126: 
2025-05-07T20:32:15.5007694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5008049Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:15.5008396Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:15.5009222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:15.5010003Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:15.5010666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.5011385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.5012109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:15.5012867Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.5013656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:15.5014442Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:15.5015231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:15.5015893Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:15.5016524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:15.5017064Z     fn()
2025-05-07T20:32:15.5017584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:15.5018272Z     self.fn.run(
2025-05-07T20:32:15.5018760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.5019313Z     kernel = self.compile(
2025-05-07T20:32:15.5019870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.5020551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.5020963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.5021201Z 
2025-05-07T20:32:15.5021414Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a53b5bd0>
2025-05-07T20:32:15.5022540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.5024361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a5ea7370>}
2025-05-07T20:32:15.5026130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.5027412Z context = <triton._C.libtriton.ir.context object at 0x7f767cd12e30>
2025-05-07T20:32:15.5027756Z 
2025-05-07T20:32:15.5027944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.5028574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.5029134Z                            module_map=module_map)
2025-05-07T20:32:15.5029556Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.5029964Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:15.5030276Z E       ^
2025-05-07T20:32:15.5030834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.5031393Z 
2025-05-07T20:32:15.5031904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.5032548Z 
2025-05-07T20:32:15.5032660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.5033142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.5033666Z     T=128,
2025-05-07T20:32:15.5033863Z     D=7168,
2025-05-07T20:32:15.5034073Z     scale_ub=None,
2025-05-07T20:32:15.5034310Z     contiguous=False,
2025-05-07T20:32:15.5034722Z     compiled=False,
2025-05-07T20:32:15.5034946Z )
2025-05-07T20:32:15.7186780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7187365Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:15.7187680Z 
2025-05-07T20:32:15.7187810Z     @given(
2025-05-07T20:32:15.7188181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7188560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7188895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7189244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7189665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7190095Z     )
2025-05-07T20:32:15.7190574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7191193Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7191566Z         self,
2025-05-07T20:32:15.7191865Z         T: int,
2025-05-07T20:32:15.7192161Z         D: int,
2025-05-07T20:32:15.7192455Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7192740Z         contiguous: bool,
2025-05-07T20:32:15.7193004Z         compiled: bool,
2025-05-07T20:32:15.7193455Z     ) -> None:
2025-05-07T20:32:15.7193834Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7194101Z     
2025-05-07T20:32:15.7194394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7194781Z     
2025-05-07T20:32:15.7194989Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.7195304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.7195629Z         x = x_sign * x_clamp
2025-05-07T20:32:15.7195883Z         x0 = x[:, :D]
2025-05-07T20:32:15.7196117Z         x1 = x[:, D:]
2025-05-07T20:32:15.7196341Z     
2025-05-07T20:32:15.7196535Z         if contiguous:
2025-05-07T20:32:15.7196784Z             x0 = x0.contiguous()
2025-05-07T20:32:15.7197057Z             x1 = x1.contiguous()
2025-05-07T20:32:15.7197311Z     
2025-05-07T20:32:15.7197514Z         if scale_ub is not None:
2025-05-07T20:32:15.7197806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.7198155Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.7198490Z             )
2025-05-07T20:32:15.7198698Z         else:
2025-05-07T20:32:15.7198917Z             scale_ub_tensor = None
2025-05-07T20:32:15.7199184Z     
2025-05-07T20:32:15.7199442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7199776Z             op = silu_mul_quant
2025-05-07T20:32:15.7200040Z             if compiled:
2025-05-07T20:32:15.7200307Z                 op = torch.compile(op)
2025-05-07T20:32:15.7200622Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7200910Z     
2025-05-07T20:32:15.7201124Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.7201298Z 
2025-05-07T20:32:15.7201413Z moe/activation_test.py:117: 
2025-05-07T20:32:15.7201726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7202085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.7202388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7203112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.7203842Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.7204407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.7205170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.7205854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.7206417Z     kernel = self.compile(
2025-05-07T20:32:15.7206984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.7208772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.7209192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7209436Z 
2025-05-07T20:32:15.7209654Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d0e35e0>
2025-05-07T20:32:15.7210774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.7212198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7694293640>}
2025-05-07T20:32:15.7213593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.7214657Z context = <triton._C.libtriton.ir.context object at 0x7f767d29f930>
2025-05-07T20:32:15.7214964Z 
2025-05-07T20:32:15.7215138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.7215764Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.7216248Z                            module_map=module_map)
2025-05-07T20:32:15.7216632Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.7217004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.7217276Z E       ^
2025-05-07T20:32:15.7217760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.7218232Z 
2025-05-07T20:32:15.7218660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.7219193Z 
2025-05-07T20:32:15.7219316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7219777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7220193Z     T=4096,
2025-05-07T20:32:15.7220395Z     D=5120,
2025-05-07T20:32:15.7220610Z     scale_ub=1200.0,
2025-05-07T20:32:15.7220843Z     contiguous=True,
2025-05-07T20:32:15.7221080Z     compiled=False,
2025-05-07T20:32:15.7221299Z )
2025-05-07T20:32:15.7221629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:15.7222147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:15.7222436Z 
2025-05-07T20:32:15.7222518Z     @given(
2025-05-07T20:32:15.7222764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:15.7223089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:15.7223413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:15.7224114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:15.7232991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:15.7233320Z     )
2025-05-07T20:32:15.7233809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:15.7234284Z     def test_silu_mul_quant(
2025-05-07T20:32:15.7234553Z         self,
2025-05-07T20:32:15.7234771Z         T: int,
2025-05-07T20:32:15.7234978Z         D: int,
2025-05-07T20:32:15.7235215Z         scale_ub: Optional[float],
2025-05-07T20:32:15.7235506Z         contiguous: bool,
2025-05-07T20:32:15.7235757Z         compiled: bool,
2025-05-07T20:32:15.7236005Z     ) -> None:
2025-05-07T20:32:15.7236240Z         torch.manual_seed(2025)
2025-05-07T20:32:15.7236503Z     
2025-05-07T20:32:15.7236795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:15.7237162Z     
2025-05-07T20:32:15.7237372Z         x_sign = torch.sign(x)
2025-05-07T20:32:15.7237676Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:15.7238180Z         x = x_sign * x_clamp
2025-05-07T20:32:15.7238445Z         x0 = x[:, :D]
2025-05-07T20:32:15.7238673Z         x1 = x[:, D:]
2025-05-07T20:32:15.7238900Z     
2025-05-07T20:32:15.7239102Z         if contiguous:
2025-05-07T20:32:15.7239350Z             x0 = x0.contiguous()
2025-05-07T20:32:15.7239623Z             x1 = x1.contiguous()
2025-05-07T20:32:15.7239884Z     
2025-05-07T20:32:15.7240084Z         if scale_ub is not None:
2025-05-07T20:32:15.7240376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:15.7240733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:15.7241053Z             )
2025-05-07T20:32:15.7241262Z         else:
2025-05-07T20:32:15.7241491Z             scale_ub_tensor = None
2025-05-07T20:32:15.7241756Z     
2025-05-07T20:32:15.7241998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:15.7242329Z             op = silu_mul_quant
2025-05-07T20:32:15.7242595Z             if compiled:
2025-05-07T20:32:15.7242860Z                 op = torch.compile(op)
2025-05-07T20:32:15.7243176Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7243470Z     
2025-05-07T20:32:15.7243671Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:15.7243850Z 
2025-05-07T20:32:15.7244085Z moe/activation_test.py:117: 
2025-05-07T20:32:15.7244408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7244791Z moe/activation_test.py:115: in fn
2025-05-07T20:32:15.7245120Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:15.7245850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:15.7246574Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:15.7247140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:15.7247859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:15.7248563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:15.7249123Z     kernel = self.compile(
2025-05-07T20:32:15.7249688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:15.7250386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:15.7250806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:15.7251045Z 
2025-05-07T20:32:15.7251270Z self = <triton.compiler.compiler.ASTSource object at 0x7f767ef4cca0>
2025-05-07T20:32:15.7252392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:15.7253827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7694290670>}
2025-05-07T20:32:15.7255223Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:15.7256294Z context = <triton._C.libtriton.ir.context object at 0x7f767c81e070>
2025-05-07T20:32:15.7256595Z 
2025-05-07T20:32:15.7256770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:15.7257316Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:15.7257811Z                            module_map=module_map)
2025-05-07T20:32:15.7258199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:15.7258570Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:15.7258849Z E       ^
2025-05-07T20:32:15.7259427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:15.7259900Z 
2025-05-07T20:32:15.7260332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:15.7260874Z 
2025-05-07T20:32:15.7260985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:15.7261426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:15.7261848Z     T=1,
2025-05-07T20:32:15.7262044Z     D=5120,
2025-05-07T20:32:15.7262257Z     scale_ub=None,
2025-05-07T20:32:15.7262490Z     contiguous=True,
2025-05-07T20:32:15.7262724Z     compiled=True,
2025-05-07T20:32:15.7262942Z )
2025-05-07T20:32:16.2129931Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:16.2131081Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:16.2132488Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:16.2134204Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:16.2135655Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:16.2137092Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.2138457Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.2139895Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.2141373Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:16.2142673Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:16.2143939Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:16.2145201Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:16.2146282Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:16.2147347Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:16.2148619Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:16.2150059Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:16.2151229Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:16.2152329Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:16.2153628Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:16.2155094Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:16.2156199Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.2157155Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.2158019Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:16.2159089Z W0507 20:32:16.209000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.3823735Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:16.3825484Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:16.3826886Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:16.3828373Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:16.3829817Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:16.3831260Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.3832619Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.3834113Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.3835589Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:16.3836887Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:16.3838331Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:16.3839585Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:16.3840666Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:16.3841727Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:16.3842990Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:16.3844329Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:16.3845481Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:16.3846716Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:16.3847941Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:16.3849344Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:16.3850451Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.3851400Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.3852176Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:16.3853241Z W0507 20:32:16.379000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.8483523Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:16.8484636Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:16.8487152Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:16.8490150Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:16.8493026Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:16.8495264Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.8496771Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.8498524Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.8500338Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:16.8501926Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:16.8503494Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:16.8505085Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:16.8506285Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:16.8507338Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:16.8508605Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:16.8509937Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:16.8511098Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:16.8512185Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:16.8513401Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:16.8514888Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:16.8516023Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.8516972Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.8517749Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:16.8518808Z W0507 20:32:16.845000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:16.8786497Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:16.8787601Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:16.8789126Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:16.8790608Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:16.8792034Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:16.8793467Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:16.8795137Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:16.8796583Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:16.8798172Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:16.8799473Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:16.8800737Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:16.8801995Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:16.8803071Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:16.8804135Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:16.8805398Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:16.8806728Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:16.8807899Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:16.8808989Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:16.8810222Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:16.8811633Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:16.8812745Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:16.8813779Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:16.8814554Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:16.8815633Z W0507 20:32:16.875000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.1979827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:17.1980413Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:17.1980709Z 
2025-05-07T20:32:17.1980801Z     @given(
2025-05-07T20:32:17.1981073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:17.1981420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:17.1981767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:17.1982137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:17.1982503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:17.1982812Z     )
2025-05-07T20:32:17.1983383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:17.1983870Z     def test_silu_mul_quant(
2025-05-07T20:32:17.1984136Z         self,
2025-05-07T20:32:17.1984356Z         T: int,
2025-05-07T20:32:17.1984581Z         D: int,
2025-05-07T20:32:17.1984824Z         scale_ub: Optional[float],
2025-05-07T20:32:17.1985124Z         contiguous: bool,
2025-05-07T20:32:17.1985394Z         compiled: bool,
2025-05-07T20:32:17.1985640Z     ) -> None:
2025-05-07T20:32:17.1985881Z         torch.manual_seed(2025)
2025-05-07T20:32:17.1986149Z     
2025-05-07T20:32:17.1986452Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:17.1986824Z     
2025-05-07T20:32:17.1987041Z         x_sign = torch.sign(x)
2025-05-07T20:32:17.1987366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:17.1987702Z         x = x_sign * x_clamp
2025-05-07T20:32:17.1987968Z         x0 = x[:, :D]
2025-05-07T20:32:17.1988211Z         x1 = x[:, D:]
2025-05-07T20:32:17.1988443Z     
2025-05-07T20:32:17.1988650Z         if contiguous:
2025-05-07T20:32:17.1988907Z             x0 = x0.contiguous()
2025-05-07T20:32:17.1989186Z             x1 = x1.contiguous()
2025-05-07T20:32:17.1989452Z     
2025-05-07T20:32:17.1989668Z         if scale_ub is not None:
2025-05-07T20:32:17.1989965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:17.1990330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:17.1990669Z             )
2025-05-07T20:32:17.1990878Z         else:
2025-05-07T20:32:17.1991114Z             scale_ub_tensor = None
2025-05-07T20:32:17.1991395Z     
2025-05-07T20:32:17.1991651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.1991994Z             op = silu_mul_quant
2025-05-07T20:32:17.1992271Z             if compiled:
2025-05-07T20:32:17.1992545Z                 op = torch.compile(op)
2025-05-07T20:32:17.1992863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:17.1993173Z     
2025-05-07T20:32:17.1993388Z         y_fp8, y_scale = fn()
2025-05-07T20:32:17.1993776Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:17.1994099Z     
2025-05-07T20:32:17.1994365Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:17.1994722Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:17.1995080Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:17.1995437Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:17.1995819Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.1996160Z     
2025-05-07T20:32:17.1996386Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:17.1996597Z 
2025-05-07T20:32:17.1996715Z moe/activation_test.py:126: 
2025-05-07T20:32:17.1997185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.1997555Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:17.1997914Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:17.1998765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:17.1999575Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:17.2000168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:17.2000911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:17.2001648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:17.2002435Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2003251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:17.2004058Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:17.2004920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:17.2005610Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:17.2006257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:17.2006812Z     fn()
2025-05-07T20:32:17.2007362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:17.2007988Z     self.fn.run(
2025-05-07T20:32:17.2008494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:17.2009068Z     kernel = self.compile(
2025-05-07T20:32:17.2009653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:17.2010364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.2010788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:17.2011043Z 
2025-05-07T20:32:17.2011264Z self = <triton.compiler.compiler.ASTSource object at 0x7f767ee14520>
2025-05-07T20:32:17.2012422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:17.2013891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767f26d480>}
2025-05-07T20:32:17.2015382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:17.2016477Z context = <triton._C.libtriton.ir.context object at 0x7f767c7d24b0>
2025-05-07T20:32:17.2016794Z 
2025-05-07T20:32:17.2016975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:17.2017535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.2018043Z                            module_map=module_map)
2025-05-07T20:32:17.2018434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.2018821Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:17.2019115Z E       ^
2025-05-07T20:32:17.2019611Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.2020187Z 
2025-05-07T20:32:17.2020635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:17.2021187Z 
2025-05-07T20:32:17.2021306Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:17.2021756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:17.2022195Z     T=2048,
2025-05-07T20:32:17.2022410Z     D=5120,
2025-05-07T20:32:17.2022631Z     scale_ub=None,
2025-05-07T20:32:17.2022874Z     contiguous=True,
2025-05-07T20:32:17.2023133Z     compiled=True,
2025-05-07T20:32:17.2023372Z )
2025-05-07T20:32:17.6695853Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:17.6696988Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:17.6698416Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:17.6700099Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:17.6701541Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:17.6702988Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.6704352Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:17.6705788Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.6707259Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:17.6708552Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:17.6709822Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:17.6711075Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:17.6712159Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:17.6713220Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:17.6714536Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:17.6716003Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:17.6717157Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:17.6718237Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:17.6719445Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:17.6720850Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:17.6721952Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.6722897Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.6723897Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:17.6724954Z W0507 20:32:17.666000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:17.8383016Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:17.8384109Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:17.8385535Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:17.8387028Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:17.8388472Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:17.8389921Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:17.8391291Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:17.8392728Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:17.8394263Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:17.8395569Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:17.8397000Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:17.8398267Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:17.8399356Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:17.8400422Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:17.8401700Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:17.8403043Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:17.8404220Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:17.8405432Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:17.8406659Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:17.8408079Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:17.8409194Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:17.8410160Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:17.8410942Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:17.8412016Z W0507 20:32:17.834000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3268092Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:18.3270215Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:18.3272898Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:18.3275533Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:18.3277112Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:18.3278683Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3280331Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:18.3281896Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3283511Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:18.3284930Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:18.3286320Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:18.3287701Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:18.3288882Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:18.3290198Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:18.3291587Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:18.3293049Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:18.3294324Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:18.3295561Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:18.3296906Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:18.3298458Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:18.3299668Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3300710Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3301553Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:18.3302734Z W0507 20:32:18.323000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.3602569Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:18.3603775Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:18.3605528Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:18.3607145Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:18.3608715Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:18.3610282Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.3611765Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:18.3613317Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.3615048Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:18.3616466Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:18.3617843Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:18.3619217Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:18.3620380Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:18.3621538Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:18.3622918Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:18.3624550Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:18.3625808Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:18.3626987Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:18.3628326Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:18.3629860Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:18.3631064Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.3632219Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:18.3633058Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:18.3634288Z W0507 20:32:18.356000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.8471590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:18.8472183Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:18.8472532Z 
2025-05-07T20:32:18.8472630Z     @given(
2025-05-07T20:32:18.8472917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:18.8473287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:18.8473718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:18.8474111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:18.8474512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:18.8474841Z     )
2025-05-07T20:32:18.8475257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:18.8475983Z     def test_silu_mul_quant(
2025-05-07T20:32:18.8476274Z         self,
2025-05-07T20:32:18.8476505Z         T: int,
2025-05-07T20:32:18.8476729Z         D: int,
2025-05-07T20:32:18.8476988Z         scale_ub: Optional[float],
2025-05-07T20:32:18.8477302Z         contiguous: bool,
2025-05-07T20:32:18.8477577Z         compiled: bool,
2025-05-07T20:32:18.8477842Z     ) -> None:
2025-05-07T20:32:18.8478096Z         torch.manual_seed(2025)
2025-05-07T20:32:18.8478372Z     
2025-05-07T20:32:18.8478690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:18.8479085Z     
2025-05-07T20:32:18.8479309Z         x_sign = torch.sign(x)
2025-05-07T20:32:18.8479650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:18.8480012Z         x = x_sign * x_clamp
2025-05-07T20:32:18.8480292Z         x0 = x[:, :D]
2025-05-07T20:32:18.8480543Z         x1 = x[:, D:]
2025-05-07T20:32:18.8480786Z     
2025-05-07T20:32:18.8481007Z         if contiguous:
2025-05-07T20:32:18.8481280Z             x0 = x0.contiguous()
2025-05-07T20:32:18.8481580Z             x1 = x1.contiguous()
2025-05-07T20:32:18.8481859Z     
2025-05-07T20:32:18.8482081Z         if scale_ub is not None:
2025-05-07T20:32:18.8482398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:18.8482785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:18.8483134Z             )
2025-05-07T20:32:18.8483359Z         else:
2025-05-07T20:32:18.8483607Z             scale_ub_tensor = None
2025-05-07T20:32:18.8483891Z     
2025-05-07T20:32:18.8484157Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.8484516Z             op = silu_mul_quant
2025-05-07T20:32:18.8484801Z             if compiled:
2025-05-07T20:32:18.8485090Z                 op = torch.compile(op)
2025-05-07T20:32:18.8485432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:18.8485747Z     
2025-05-07T20:32:18.8485968Z         y_fp8, y_scale = fn()
2025-05-07T20:32:18.8486297Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:18.8486635Z     
2025-05-07T20:32:18.8486903Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:18.8487284Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:18.8487620Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:18.8487973Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:18.8488386Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.8488742Z     
2025-05-07T20:32:18.8488970Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:18.8489197Z 
2025-05-07T20:32:18.8489313Z moe/activation_test.py:126: 
2025-05-07T20:32:18.8489655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8490270Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:18.8490648Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:18.8491542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:18.8492403Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:18.8493018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:18.8493791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:18.8494568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:18.8495387Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.8496240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:18.8497088Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:18.8497918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:18.8498733Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:18.8499410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:18.8499999Z     fn()
2025-05-07T20:32:18.8500578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:18.8501233Z     self.fn.run(
2025-05-07T20:32:18.8501770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:18.8502375Z     kernel = self.compile(
2025-05-07T20:32:18.8502993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:18.8503729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:18.8504183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:18.8504447Z 
2025-05-07T20:32:18.8504686Z self = <triton.compiler.compiler.ASTSource object at 0x7f769444afe0>
2025-05-07T20:32:18.8505945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:18.8507498Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767d2ed000>}
2025-05-07T20:32:18.8509004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:18.8510155Z context = <triton._C.libtriton.ir.context object at 0x7f767c620870>
2025-05-07T20:32:18.8510491Z 
2025-05-07T20:32:18.8510680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:18.8511272Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:18.8511802Z                            module_map=module_map)
2025-05-07T20:32:18.8512216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:18.8512621Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:18.8512919Z E       ^
2025-05-07T20:32:18.8513444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:18.8514034Z 
2025-05-07T20:32:18.8514596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:18.8515173Z 
2025-05-07T20:32:18.8515299Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:18.8515811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:18.8516268Z     T=128,
2025-05-07T20:32:18.8516488Z     D=5120,
2025-05-07T20:32:18.8516707Z     scale_ub=None,
2025-05-07T20:32:18.8516954Z     contiguous=True,
2025-05-07T20:32:18.8517214Z     compiled=True,
2025-05-07T20:32:18.8517446Z )
2025-05-07T20:32:19.3795676Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:19.3796894Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:19.3798411Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:19.3800001Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:19.3801719Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:19.3803270Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.3804732Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:19.3806343Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.3807945Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:19.3809344Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:19.3810709Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:19.3812073Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:19.3813236Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:19.3814383Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:19.3815748Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:19.3817193Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:19.3818571Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:19.3819745Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:19.3821078Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:19.3822591Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:19.3823957Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.3824992Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.3825831Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:19.3827145Z W0507 20:32:19.375000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.5652547Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:19.5653728Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:19.5655224Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:19.5656810Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:19.5658358Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:19.5659899Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.5661352Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:19.5662887Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.5664473Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:19.5665862Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:19.5667222Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:19.5668721Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:19.5669880Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:19.5671017Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:19.5672373Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:19.5673942Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:19.5675186Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:19.5676348Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:19.5677788Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:19.5679301Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:19.5680475Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.5681494Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.5682320Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:19.5683455Z W0507 20:32:19.561000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.0749998Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.0751206Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:20.0752718Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.0754403Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.0755965Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.0757508Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.0758965Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.0760670Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.0762257Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.0763652Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:20.0765013Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.0766376Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:20.0767534Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:20.0768801Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:20.0770169Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.0771596Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.0772852Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:20.0774020Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:20.0775340Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.0776860Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.0778041Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.0779061Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.0779898Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:20.0781038Z W0507 20:32:20.071000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.1082351Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.1084389Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:20.1086463Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.1088205Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.1089754Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.1091304Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.1092753Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.1094294Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.1095878Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.1097392Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:20.1098754Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.1100105Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:20.1101264Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:20.1102405Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:20.1103780Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.1105213Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.1106457Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:20.1107633Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:20.1108952Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.1110474Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.1111662Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.1112676Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.1113707Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:20.1114853Z W0507 20:32:20.104000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.5423751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.5425461Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:20.5426065Z 
2025-05-07T20:32:20.5426185Z     @given(
2025-05-07T20:32:20.5426482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.5426836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.5427189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.5427571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.5427945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.5428280Z     )
2025-05-07T20:32:20.5428701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.5429202Z     def test_silu_mul_quant(
2025-05-07T20:32:20.5429490Z         self,
2025-05-07T20:32:20.5429723Z         T: int,
2025-05-07T20:32:20.5430575Z         D: int,
2025-05-07T20:32:20.5430831Z         scale_ub: Optional[float],
2025-05-07T20:32:20.5431155Z         contiguous: bool,
2025-05-07T20:32:20.5431438Z         compiled: bool,
2025-05-07T20:32:20.5431697Z     ) -> None:
2025-05-07T20:32:20.5431948Z         torch.manual_seed(2025)
2025-05-07T20:32:20.5432239Z     
2025-05-07T20:32:20.5432548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.5432942Z     
2025-05-07T20:32:20.5433168Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.5433580Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.5433937Z         x = x_sign * x_clamp
2025-05-07T20:32:20.5434219Z         x0 = x[:, :D]
2025-05-07T20:32:20.5434463Z         x1 = x[:, D:]
2025-05-07T20:32:20.5434732Z     
2025-05-07T20:32:20.5434949Z         if contiguous:
2025-05-07T20:32:20.5435220Z             x0 = x0.contiguous()
2025-05-07T20:32:20.5435509Z             x1 = x1.contiguous()
2025-05-07T20:32:20.5435794Z     
2025-05-07T20:32:20.5436019Z         if scale_ub is not None:
2025-05-07T20:32:20.5436327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.5436712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.5437066Z             )
2025-05-07T20:32:20.5437283Z         else:
2025-05-07T20:32:20.5437527Z             scale_ub_tensor = None
2025-05-07T20:32:20.5437819Z     
2025-05-07T20:32:20.5438088Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.5438442Z             op = silu_mul_quant
2025-05-07T20:32:20.5438730Z             if compiled:
2025-05-07T20:32:20.5439018Z                 op = torch.compile(op)
2025-05-07T20:32:20.5439352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.5439672Z     
2025-05-07T20:32:20.5439902Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.5440225Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.5440560Z     
2025-05-07T20:32:20.5440834Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.5441217Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.5441556Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.5441918Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.5442323Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.5442680Z     
2025-05-07T20:32:20.5442917Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.5443139Z 
2025-05-07T20:32:20.5443264Z moe/activation_test.py:126: 
2025-05-07T20:32:20.5443603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.5443992Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.5444518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.5445410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.5446270Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.5446894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.5447676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.5448452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.5449277Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.5450133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:20.5450994Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.5451817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.5452549Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.5453322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.5453910Z     fn()
2025-05-07T20:32:20.5454493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.5455151Z     self.fn.run(
2025-05-07T20:32:20.5455686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.5456285Z     kernel = self.compile(
2025-05-07T20:32:20.5456901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.5457651Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.5458100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.5458363Z 
2025-05-07T20:32:20.5458605Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbe0d60>
2025-05-07T20:32:20.5459829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.5461378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca01360>}
2025-05-07T20:32:20.5462900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.5464048Z context = <triton._C.libtriton.ir.context object at 0x7f7677f97370>
2025-05-07T20:32:20.5464379Z 
2025-05-07T20:32:20.5464570Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.5465170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.5465716Z                            module_map=module_map)
2025-05-07T20:32:20.5466134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.5466551Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.5466862Z E       ^
2025-05-07T20:32:20.5467392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.5467917Z 
2025-05-07T20:32:20.5468395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.5468979Z 
2025-05-07T20:32:20.5469189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.5469667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.5470117Z     T=4096,
2025-05-07T20:32:20.5470334Z     D=5120,
2025-05-07T20:32:20.5470564Z     scale_ub=None,
2025-05-07T20:32:20.5470807Z     contiguous=True,
2025-05-07T20:32:20.5471065Z     compiled=True,
2025-05-07T20:32:20.5471303Z )
2025-05-07T20:32:21.0805703Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.0806916Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:21.0808433Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.0810031Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.0811759Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.0813307Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.0814765Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.0816312Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.0817901Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.0819306Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:21.0820669Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.0822033Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:21.0823200Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:21.0824687Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:21.0826048Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.0827531Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.0828921Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:21.0830087Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:21.0831398Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.0832906Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.0834202Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.0835221Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.0836055Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:21.0837238Z W0507 20:32:21.076000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.2638163Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.2639358Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:21.2641024Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.2642612Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.2644152Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.2645685Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.2647133Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.2656029Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.2657606Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.2658984Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:21.2660321Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.2661641Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:21.2662959Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:21.2664083Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:21.2665427Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.2666827Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.2668058Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:21.2669209Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:21.2670505Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.2672118Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.2673277Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.2674380Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.2675202Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:21.2676324Z W0507 20:32:21.260000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.7689085Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.7690272Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:21.7691753Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.7693340Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.7694870Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.7696398Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7697843Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.7699551Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7701124Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.7702508Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:21.7703848Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.7705182Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:21.7706331Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:21.7707458Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:21.7708937Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.7710353Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.7711591Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:21.7712749Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:21.7714135Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.7715656Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.7716874Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7717879Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7718695Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:21.7719821Z W0507 20:32:21.765000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8015508Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.8016745Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:21.8018241Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.8019979Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.8021523Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.8023075Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8025109Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.8026872Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8028446Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.8029987Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:21.8031349Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.8032695Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:21.8033933Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:21.8035065Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:21.8036436Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.8037864Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.8039111Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:21.8040279Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:21.8041591Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.8043113Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.8044297Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8045315Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.8046144Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:21.8047409Z W0507 20:32:21.798000 86874 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.2388449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.2389035Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.2389331Z 
2025-05-07T20:32:22.2389417Z     @given(
2025-05-07T20:32:22.2389671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.2390006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.2390339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.2390702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.2391056Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.2391373Z     )
2025-05-07T20:32:22.2391756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.2392231Z     def test_silu_mul_quant(
2025-05-07T20:32:22.2392496Z         self,
2025-05-07T20:32:22.2392712Z         T: int,
2025-05-07T20:32:22.2392920Z         D: int,
2025-05-07T20:32:22.2393160Z         scale_ub: Optional[float],
2025-05-07T20:32:22.2393724Z         contiguous: bool,
2025-05-07T20:32:22.2393987Z         compiled: bool,
2025-05-07T20:32:22.2394225Z     ) -> None:
2025-05-07T20:32:22.2394461Z         torch.manual_seed(2025)
2025-05-07T20:32:22.2394724Z     
2025-05-07T20:32:22.2395014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.2395382Z     
2025-05-07T20:32:22.2395594Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.2395904Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.2396240Z         x = x_sign * x_clamp
2025-05-07T20:32:22.2396502Z         x0 = x[:, :D]
2025-05-07T20:32:22.2396733Z         x1 = x[:, D:]
2025-05-07T20:32:22.2396959Z     
2025-05-07T20:32:22.2397160Z         if contiguous:
2025-05-07T20:32:22.2397411Z             x0 = x0.contiguous()
2025-05-07T20:32:22.2397690Z             x1 = x1.contiguous()
2025-05-07T20:32:22.2397950Z     
2025-05-07T20:32:22.2398153Z         if scale_ub is not None:
2025-05-07T20:32:22.2398449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.2398822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.2399156Z             )
2025-05-07T20:32:22.2399361Z         else:
2025-05-07T20:32:22.2399589Z             scale_ub_tensor = None
2025-05-07T20:32:22.2399863Z     
2025-05-07T20:32:22.2400110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.2400448Z             op = silu_mul_quant
2025-05-07T20:32:22.2400720Z             if compiled:
2025-05-07T20:32:22.2400987Z                 op = torch.compile(op)
2025-05-07T20:32:22.2401309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.2401606Z     
2025-05-07T20:32:22.2401813Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.2402125Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.2402438Z     
2025-05-07T20:32:22.2402687Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.2403045Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.2403366Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.2403704Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.2404084Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.2404421Z     
2025-05-07T20:32:22.2404642Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.2404850Z 
2025-05-07T20:32:22.2404957Z moe/activation_test.py:126: 
2025-05-07T20:32:22.2405282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2405647Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.2405996Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.2406963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.2407766Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.2408353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.2409078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.2409813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.2410586Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.2411390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.2412179Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.2412961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.2413645Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.2414282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.2414921Z     fn()
2025-05-07T20:32:22.2415465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.2416090Z     self.fn.run(
2025-05-07T20:32:22.2416586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.2417156Z     kernel = self.compile(
2025-05-07T20:32:22.2417734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.2418426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.2418856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.2419103Z 
2025-05-07T20:32:22.2419324Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbe31c0>
2025-05-07T20:32:22.2420467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.2421943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767c4130a0>}
2025-05-07T20:32:22.2423353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.2424895Z context = <triton._C.libtriton.ir.context object at 0x7f7677971ef0>
2025-05-07T20:32:22.2425215Z 
2025-05-07T20:32:22.2425395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.2425956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.2426467Z                            module_map=module_map)
2025-05-07T20:32:22.2426903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.2427293Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.2427574Z E       ^
2025-05-07T20:32:22.2428067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.2428546Z 
2025-05-07T20:32:22.2428986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.2429523Z 
2025-05-07T20:32:22.2429642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.2430220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.2430658Z     T=16384,
2025-05-07T20:32:22.2430873Z     D=5120,
2025-05-07T20:32:22.2431088Z     scale_ub=None,
2025-05-07T20:32:22.2431319Z     contiguous=True,
2025-05-07T20:32:22.2431564Z     compiled=True,
2025-05-07T20:32:22.2431785Z )
2025-05-07T20:32:22.2861195Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:22.2862518Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:22.2863935Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:22.2865001Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:22.2866178Z W0507 20:32:22.284000 86874 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:22.3948967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.3949979Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.3950514Z 
2025-05-07T20:32:22.3950670Z     @given(
2025-05-07T20:32:22.3951127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.3951728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.3952321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.3952961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.3953676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.3954224Z     )
2025-05-07T20:32:22.3954919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.3955770Z     def test_silu_mul_quant(
2025-05-07T20:32:22.3956233Z         self,
2025-05-07T20:32:22.3956514Z         T: int,
2025-05-07T20:32:22.3956748Z         D: int,
2025-05-07T20:32:22.3956991Z         scale_ub: Optional[float],
2025-05-07T20:32:22.3957299Z         contiguous: bool,
2025-05-07T20:32:22.3957573Z         compiled: bool,
2025-05-07T20:32:22.3957824Z     ) -> None:
2025-05-07T20:32:22.3958066Z         torch.manual_seed(2025)
2025-05-07T20:32:22.3958336Z     
2025-05-07T20:32:22.3958635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.3959017Z     
2025-05-07T20:32:22.3959238Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.3959558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.3959905Z         x = x_sign * x_clamp
2025-05-07T20:32:22.3960177Z         x0 = x[:, :D]
2025-05-07T20:32:22.3960422Z         x1 = x[:, D:]
2025-05-07T20:32:22.3960657Z     
2025-05-07T20:32:22.3960867Z         if contiguous:
2025-05-07T20:32:22.3961130Z             x0 = x0.contiguous()
2025-05-07T20:32:22.3961414Z             x1 = x1.contiguous()
2025-05-07T20:32:22.3961690Z     
2025-05-07T20:32:22.3961908Z         if scale_ub is not None:
2025-05-07T20:32:22.3962210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.3962583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.3962929Z             )
2025-05-07T20:32:22.3963143Z         else:
2025-05-07T20:32:22.3963385Z             scale_ub_tensor = None
2025-05-07T20:32:22.3963668Z     
2025-05-07T20:32:22.3963926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3964275Z             op = silu_mul_quant
2025-05-07T20:32:22.3964560Z             if compiled:
2025-05-07T20:32:22.3964835Z                 op = torch.compile(op)
2025-05-07T20:32:22.3965166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.3965665Z     
2025-05-07T20:32:22.3965890Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.3966205Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.3966535Z     
2025-05-07T20:32:22.3966803Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.3967173Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.3967501Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.3967855Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.3968251Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.3968598Z     
2025-05-07T20:32:22.3968831Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.3969052Z 
2025-05-07T20:32:22.3969164Z moe/activation_test.py:126: 
2025-05-07T20:32:22.3969499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3969873Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.3970248Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.3971112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.3972069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.3972675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.3973425Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.3974186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.3974986Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.3975819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.3976647Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.3977449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.3978157Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.3978824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.3979394Z     fn()
2025-05-07T20:32:22.3979956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.3980596Z     self.fn.run(
2025-05-07T20:32:22.3981109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.3981699Z     kernel = self.compile(
2025-05-07T20:32:22.3982299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.3983027Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.3983461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.3983718Z 
2025-05-07T20:32:22.3983949Z self = <triton.compiler.compiler.ASTSource object at 0x7f767781f070>
2025-05-07T20:32:22.3985132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.3986634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca01b40>}
2025-05-07T20:32:22.3988214Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.3989337Z context = <triton._C.libtriton.ir.context object at 0x7f76773b5370>
2025-05-07T20:32:22.3989657Z 
2025-05-07T20:32:22.3989841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.3990419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.3990933Z                            module_map=module_map)
2025-05-07T20:32:22.3991339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.3991734Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.3992036Z E       ^
2025-05-07T20:32:22.3992544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.3993045Z 
2025-05-07T20:32:22.3993559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.3994120Z 
2025-05-07T20:32:22.3994249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.3994707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.3995147Z     T=1,
2025-05-07T20:32:22.3995357Z     D=5120,
2025-05-07T20:32:22.3995709Z     scale_ub=1200.0,
2025-05-07T20:32:22.3995959Z     contiguous=True,
2025-05-07T20:32:22.3996211Z     compiled=True,
2025-05-07T20:32:22.3996445Z )
2025-05-07T20:32:22.7581113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.7582192Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:22.7582751Z 
2025-05-07T20:32:22.7582921Z     @given(
2025-05-07T20:32:22.7583419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.7584076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.7584733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.7585432Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.7586145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.7586644Z     )
2025-05-07T20:32:22.7587025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.7587502Z     def test_silu_mul_quant(
2025-05-07T20:32:22.7587770Z         self,
2025-05-07T20:32:22.7587987Z         T: int,
2025-05-07T20:32:22.7588206Z         D: int,
2025-05-07T20:32:22.7588443Z         scale_ub: Optional[float],
2025-05-07T20:32:22.7588739Z         contiguous: bool,
2025-05-07T20:32:22.7589000Z         compiled: bool,
2025-05-07T20:32:22.7589243Z     ) -> None:
2025-05-07T20:32:22.7589481Z         torch.manual_seed(2025)
2025-05-07T20:32:22.7589745Z     
2025-05-07T20:32:22.7590034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.7590402Z     
2025-05-07T20:32:22.7590616Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.7590930Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.7591262Z         x = x_sign * x_clamp
2025-05-07T20:32:22.7591527Z         x0 = x[:, :D]
2025-05-07T20:32:22.7591767Z         x1 = x[:, D:]
2025-05-07T20:32:22.7591988Z     
2025-05-07T20:32:22.7592193Z         if contiguous:
2025-05-07T20:32:22.7592457Z             x0 = x0.contiguous()
2025-05-07T20:32:22.7592731Z             x1 = x1.contiguous()
2025-05-07T20:32:22.7592992Z     
2025-05-07T20:32:22.7593204Z         if scale_ub is not None:
2025-05-07T20:32:22.7593495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.7593935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.7594272Z             )
2025-05-07T20:32:22.7594478Z         else:
2025-05-07T20:32:22.7594713Z             scale_ub_tensor = None
2025-05-07T20:32:22.7594986Z     
2025-05-07T20:32:22.7595231Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.7595572Z             op = silu_mul_quant
2025-05-07T20:32:22.7595852Z             if compiled:
2025-05-07T20:32:22.7596280Z                 op = torch.compile(op)
2025-05-07T20:32:22.7596598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.7596923Z     
2025-05-07T20:32:22.7597137Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.7597315Z 
2025-05-07T20:32:22.7597427Z moe/activation_test.py:117: 
2025-05-07T20:32:22.7597747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.7598100Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.7598399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.7598996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:22.7599588Z     return fn(*args, **kwargs)
2025-05-07T20:32:22.7600286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.7601004Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.7601580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.7602304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.7602998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.7603692Z     kernel = self.compile(
2025-05-07T20:32:22.7604266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.7604957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.7605370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.7605617Z 
2025-05-07T20:32:22.7605832Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677646080>
2025-05-07T20:32:22.7607019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.7608481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767cc3d480>}
2025-05-07T20:32:22.7609890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.7610966Z context = <triton._C.libtriton.ir.context object at 0x7f7676e588f0>
2025-05-07T20:32:22.7611278Z 
2025-05-07T20:32:22.7611456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.7612012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.7612505Z                            module_map=module_map)
2025-05-07T20:32:22.7612901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.7613276Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.7613558Z E       ^
2025-05-07T20:32:22.7614048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.7614533Z 
2025-05-07T20:32:22.7614969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.7615505Z 
2025-05-07T20:32:22.7615625Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.7616062Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.7616496Z     T=1,
2025-05-07T20:32:22.7616701Z     D=5120,
2025-05-07T20:32:22.7616913Z     scale_ub=None,
2025-05-07T20:32:22.7617142Z     contiguous=False,
2025-05-07T20:32:22.7617386Z     compiled=True,
2025-05-07T20:32:22.7617611Z )
2025-05-07T20:32:22.8318861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.8319427Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.8319728Z 
2025-05-07T20:32:22.8319815Z     @given(
2025-05-07T20:32:22.8327556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.8328025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.8328360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.8328716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.8329061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.8329369Z     )
2025-05-07T20:32:22.8329745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.8330212Z     def test_silu_mul_quant(
2025-05-07T20:32:22.8330471Z         self,
2025-05-07T20:32:22.8330682Z         T: int,
2025-05-07T20:32:22.8330890Z         D: int,
2025-05-07T20:32:22.8331121Z         scale_ub: Optional[float],
2025-05-07T20:32:22.8331422Z         contiguous: bool,
2025-05-07T20:32:22.8331676Z         compiled: bool,
2025-05-07T20:32:22.8331920Z     ) -> None:
2025-05-07T20:32:22.8332150Z         torch.manual_seed(2025)
2025-05-07T20:32:22.8332407Z     
2025-05-07T20:32:22.8332867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.8333231Z     
2025-05-07T20:32:22.8333441Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.8333747Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.8334075Z         x = x_sign * x_clamp
2025-05-07T20:32:22.8334332Z         x0 = x[:, :D]
2025-05-07T20:32:22.8334558Z         x1 = x[:, D:]
2025-05-07T20:32:22.8334780Z     
2025-05-07T20:32:22.8334984Z         if contiguous:
2025-05-07T20:32:22.8335222Z             x0 = x0.contiguous()
2025-05-07T20:32:22.8335498Z             x1 = x1.contiguous()
2025-05-07T20:32:22.8335751Z     
2025-05-07T20:32:22.8335952Z         if scale_ub is not None:
2025-05-07T20:32:22.8336245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.8336597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.8336917Z             )
2025-05-07T20:32:22.8337124Z         else:
2025-05-07T20:32:22.8337346Z             scale_ub_tensor = None
2025-05-07T20:32:22.8337625Z     
2025-05-07T20:32:22.8337869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.8338205Z             op = silu_mul_quant
2025-05-07T20:32:22.8338471Z             if compiled:
2025-05-07T20:32:22.8338733Z                 op = torch.compile(op)
2025-05-07T20:32:22.8339050Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.8339339Z     
2025-05-07T20:32:22.8339546Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.8339847Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.8340144Z     
2025-05-07T20:32:22.8340394Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.8340743Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.8341049Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.8341379Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.8341756Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.8342103Z     
2025-05-07T20:32:22.8342312Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.8342526Z 
2025-05-07T20:32:22.8342634Z moe/activation_test.py:126: 
2025-05-07T20:32:22.8342950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8343300Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.8343645Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.8344468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.8345256Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.8345948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.8346696Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.8347443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.8348205Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.8348994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:22.8349783Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.8350547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.8351215Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.8351850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.8352392Z     fn()
2025-05-07T20:32:22.8352925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.8353707Z     self.fn.run(
2025-05-07T20:32:22.8354215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.8354771Z     kernel = self.compile(
2025-05-07T20:32:22.8355332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.8356016Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.8356439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.8356681Z 
2025-05-07T20:32:22.8356935Z self = <triton.compiler.compiler.ASTSource object at 0x7f76776d7940>
2025-05-07T20:32:22.8358079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.8359525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea89d0>}
2025-05-07T20:32:22.8360924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.8361993Z context = <triton._C.libtriton.ir.context object at 0x7f7676e281b0>
2025-05-07T20:32:22.8362293Z 
2025-05-07T20:32:22.8362473Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.8363017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.8363513Z                            module_map=module_map)
2025-05-07T20:32:22.8363898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.8364268Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.8364556Z E       ^
2025-05-07T20:32:22.8365042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.8365518Z 
2025-05-07T20:32:22.8365957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.8366491Z 
2025-05-07T20:32:22.8366601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.8367035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.8367454Z     T=1,
2025-05-07T20:32:22.8367646Z     D=5120,
2025-05-07T20:32:22.8367852Z     scale_ub=None,
2025-05-07T20:32:22.8368076Z     contiguous=True,
2025-05-07T20:32:22.8368312Z     compiled=False,
2025-05-07T20:32:22.8368619Z )
2025-05-07T20:32:23.0055401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0056009Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:23.0056397Z 
2025-05-07T20:32:23.0056548Z     @given(
2025-05-07T20:32:23.0057094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0057720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0058305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0058922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0059540Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0060088Z     )
2025-05-07T20:32:23.0060745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0061583Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0062047Z         self,
2025-05-07T20:32:23.0062418Z         T: int,
2025-05-07T20:32:23.0062797Z         D: int,
2025-05-07T20:32:23.0063216Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0063732Z         contiguous: bool,
2025-05-07T20:32:23.0064182Z         compiled: bool,
2025-05-07T20:32:23.0064610Z     ) -> None:
2025-05-07T20:32:23.0065325Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0065783Z     
2025-05-07T20:32:23.0066305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0066924Z     
2025-05-07T20:32:23.0067146Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0067454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0067778Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0068028Z         x0 = x[:, :D]
2025-05-07T20:32:23.0068255Z         x1 = x[:, D:]
2025-05-07T20:32:23.0068475Z     
2025-05-07T20:32:23.0068668Z         if contiguous:
2025-05-07T20:32:23.0068916Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0069189Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0069443Z     
2025-05-07T20:32:23.0069646Z         if scale_ub is not None:
2025-05-07T20:32:23.0069936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0070289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0070616Z             )
2025-05-07T20:32:23.0070823Z         else:
2025-05-07T20:32:23.0071052Z             scale_ub_tensor = None
2025-05-07T20:32:23.0071314Z     
2025-05-07T20:32:23.0071558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0071889Z             op = silu_mul_quant
2025-05-07T20:32:23.0072151Z             if compiled:
2025-05-07T20:32:23.0072423Z                 op = torch.compile(op)
2025-05-07T20:32:23.0072738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0073030Z     
2025-05-07T20:32:23.0073236Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0073410Z 
2025-05-07T20:32:23.0073608Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0073925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0074272Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0074571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0075297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0076018Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0076584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0077302Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0078000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0078553Z     kernel = self.compile(
2025-05-07T20:32:23.0079120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0079931Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0080344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0080586Z 
2025-05-07T20:32:23.0080801Z self = <triton.compiler.compiler.ASTSource object at 0x7f76778f7f10>
2025-05-07T20:32:23.0081929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0083355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767cc3d1b0>}
2025-05-07T20:32:23.0084752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0085818Z context = <triton._C.libtriton.ir.context object at 0x7f7676e4ce70>
2025-05-07T20:32:23.0086125Z 
2025-05-07T20:32:23.0086304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0086930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0087420Z                            module_map=module_map)
2025-05-07T20:32:23.0087800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0088172Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0088444Z E       ^
2025-05-07T20:32:23.0088924Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0089399Z 
2025-05-07T20:32:23.0089830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0090368Z 
2025-05-07T20:32:23.0090482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0090927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0091343Z     T=128,
2025-05-07T20:32:23.0091541Z     D=5120,
2025-05-07T20:32:23.0091746Z     scale_ub=None,
2025-05-07T20:32:23.0091979Z     contiguous=False,
2025-05-07T20:32:23.0092220Z     compiled=True,
2025-05-07T20:32:23.0092435Z )
2025-05-07T20:32:23.0092766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.0093282Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.0093563Z 
2025-05-07T20:32:23.0093648Z     @given(
2025-05-07T20:32:23.0093887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.0094215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.0094540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.0094891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.0095248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.0095563Z     )
2025-05-07T20:32:23.0095933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.0096394Z     def test_silu_mul_quant(
2025-05-07T20:32:23.0096686Z         self,
2025-05-07T20:32:23.0096901Z         T: int,
2025-05-07T20:32:23.0097116Z         D: int,
2025-05-07T20:32:23.0097344Z         scale_ub: Optional[float],
2025-05-07T20:32:23.0097639Z         contiguous: bool,
2025-05-07T20:32:23.0097898Z         compiled: bool,
2025-05-07T20:32:23.0098131Z     ) -> None:
2025-05-07T20:32:23.0098373Z         torch.manual_seed(2025)
2025-05-07T20:32:23.0098635Z     
2025-05-07T20:32:23.0098919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.0099280Z     
2025-05-07T20:32:23.0099490Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.0099792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.0100120Z         x = x_sign * x_clamp
2025-05-07T20:32:23.0100489Z         x0 = x[:, :D]
2025-05-07T20:32:23.0100725Z         x1 = x[:, D:]
2025-05-07T20:32:23.0100942Z     
2025-05-07T20:32:23.0101142Z         if contiguous:
2025-05-07T20:32:23.0101388Z             x0 = x0.contiguous()
2025-05-07T20:32:23.0101668Z             x1 = x1.contiguous()
2025-05-07T20:32:23.0101922Z     
2025-05-07T20:32:23.0102128Z         if scale_ub is not None:
2025-05-07T20:32:23.0102413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.0102767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.0103098Z             )
2025-05-07T20:32:23.0103300Z         else:
2025-05-07T20:32:23.0103529Z             scale_ub_tensor = None
2025-05-07T20:32:23.0103798Z     
2025-05-07T20:32:23.0104039Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.0104373Z             op = silu_mul_quant
2025-05-07T20:32:23.0104639Z             if compiled:
2025-05-07T20:32:23.0104898Z                 op = torch.compile(op)
2025-05-07T20:32:23.0105223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0105515Z     
2025-05-07T20:32:23.0105721Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.0105896Z 
2025-05-07T20:32:23.0106001Z moe/activation_test.py:117: 
2025-05-07T20:32:23.0106406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0106754Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.0107051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.0107644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.0108234Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.0108920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.0109642Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.0110211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.0110930Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.0111618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.0112186Z     kernel = self.compile(
2025-05-07T20:32:23.0112760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.0113452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.0113926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.0114174Z 
2025-05-07T20:32:23.0114391Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbc4fd0>
2025-05-07T20:32:23.0115520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.0116996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677e27910>}
2025-05-07T20:32:23.0118399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.0119470Z context = <triton._C.libtriton.ir.context object at 0x7f767706a3f0>
2025-05-07T20:32:23.0119777Z 
2025-05-07T20:32:23.0119957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.0120510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.0121003Z                            module_map=module_map)
2025-05-07T20:32:23.0121482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.0121859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.0122130Z E       ^
2025-05-07T20:32:23.0122621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.0123101Z 
2025-05-07T20:32:23.0123534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.0124396Z 
2025-05-07T20:32:23.0124518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.0124951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.0125374Z     T=128,
2025-05-07T20:32:23.0125575Z     D=7168,
2025-05-07T20:32:23.0125777Z     scale_ub=1200.0,
2025-05-07T20:32:23.0126017Z     contiguous=False,
2025-05-07T20:32:23.0126258Z     compiled=False,
2025-05-07T20:32:23.0126478Z )
2025-05-07T20:32:23.1417695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.1418258Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:23.1418558Z 
2025-05-07T20:32:23.1418642Z     @given(
2025-05-07T20:32:23.1418890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.1419388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.1419726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.1420076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.1420417Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.1420723Z     )
2025-05-07T20:32:23.1421093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.1421560Z     def test_silu_mul_quant(
2025-05-07T20:32:23.1421823Z         self,
2025-05-07T20:32:23.1422032Z         T: int,
2025-05-07T20:32:23.1422240Z         D: int,
2025-05-07T20:32:23.1422474Z         scale_ub: Optional[float],
2025-05-07T20:32:23.1422763Z         contiguous: bool,
2025-05-07T20:32:23.1423024Z         compiled: bool,
2025-05-07T20:32:23.1423256Z     ) -> None:
2025-05-07T20:32:23.1423488Z         torch.manual_seed(2025)
2025-05-07T20:32:23.1423745Z     
2025-05-07T20:32:23.1424198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.1424568Z     
2025-05-07T20:32:23.1424779Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.1425085Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.1425417Z         x = x_sign * x_clamp
2025-05-07T20:32:23.1425677Z         x0 = x[:, :D]
2025-05-07T20:32:23.1425907Z         x1 = x[:, D:]
2025-05-07T20:32:23.1426132Z     
2025-05-07T20:32:23.1426331Z         if contiguous:
2025-05-07T20:32:23.1426575Z             x0 = x0.contiguous()
2025-05-07T20:32:23.1426898Z             x1 = x1.contiguous()
2025-05-07T20:32:23.1427164Z     
2025-05-07T20:32:23.1427362Z         if scale_ub is not None:
2025-05-07T20:32:23.1427653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.1428015Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.1428342Z             )
2025-05-07T20:32:23.1428547Z         else:
2025-05-07T20:32:23.1428775Z             scale_ub_tensor = None
2025-05-07T20:32:23.1429048Z     
2025-05-07T20:32:23.1429292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.1429628Z             op = silu_mul_quant
2025-05-07T20:32:23.1429895Z             if compiled:
2025-05-07T20:32:23.1430153Z                 op = torch.compile(op)
2025-05-07T20:32:23.1430471Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1430765Z     
2025-05-07T20:32:23.1430971Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.1431151Z 
2025-05-07T20:32:23.1431256Z moe/activation_test.py:117: 
2025-05-07T20:32:23.1431569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1431910Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.1432212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1433065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.1433850Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.1434412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.1435128Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.1435827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.1436392Z     kernel = self.compile(
2025-05-07T20:32:23.1436956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.1437647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.1438070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1438308Z 
2025-05-07T20:32:23.1438522Z self = <triton.compiler.compiler.ASTSource object at 0x7f76776d5900>
2025-05-07T20:32:23.1439670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.1441223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677e26d40>}
2025-05-07T20:32:23.1442623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.1443690Z context = <triton._C.libtriton.ir.context object at 0x7f767703d130>
2025-05-07T20:32:23.1443990Z 
2025-05-07T20:32:23.1444177Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.1444721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.1445217Z                            module_map=module_map)
2025-05-07T20:32:23.1445612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.1445980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.1446258Z E       ^
2025-05-07T20:32:23.1446774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.1447267Z 
2025-05-07T20:32:23.1447704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.1448236Z 
2025-05-07T20:32:23.1448348Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.1448791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.1449221Z     T=128,
2025-05-07T20:32:23.1449427Z     D=5120,
2025-05-07T20:32:23.1449627Z     scale_ub=None,
2025-05-07T20:32:23.1449858Z     contiguous=False,
2025-05-07T20:32:23.1450099Z     compiled=False,
2025-05-07T20:32:23.1450317Z )
2025-05-07T20:32:23.1450657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.1451178Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:23.1451461Z 
2025-05-07T20:32:23.1451545Z     @given(
2025-05-07T20:32:23.1451795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.1452127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.1452447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.1452801Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.1453152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.1453455Z     )
2025-05-07T20:32:23.1453907Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.1454375Z     def test_silu_mul_quant(
2025-05-07T20:32:23.1454633Z         self,
2025-05-07T20:32:23.1454834Z         T: int,
2025-05-07T20:32:23.1455045Z         D: int,
2025-05-07T20:32:23.1455288Z         scale_ub: Optional[float],
2025-05-07T20:32:23.1455573Z         contiguous: bool,
2025-05-07T20:32:23.1455831Z         compiled: bool,
2025-05-07T20:32:23.1456067Z     ) -> None:
2025-05-07T20:32:23.1456291Z         torch.manual_seed(2025)
2025-05-07T20:32:23.1456548Z     
2025-05-07T20:32:23.1456837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.1457192Z     
2025-05-07T20:32:23.1457403Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.1457710Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.1458030Z         x = x_sign * x_clamp
2025-05-07T20:32:23.1458292Z         x0 = x[:, :D]
2025-05-07T20:32:23.1458521Z         x1 = x[:, D:]
2025-05-07T20:32:23.1458748Z     
2025-05-07T20:32:23.1458949Z         if contiguous:
2025-05-07T20:32:23.1459197Z             x0 = x0.contiguous()
2025-05-07T20:32:23.1459473Z             x1 = x1.contiguous()
2025-05-07T20:32:23.1459723Z     
2025-05-07T20:32:23.1459930Z         if scale_ub is not None:
2025-05-07T20:32:23.1460342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.1460693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.1461024Z             )
2025-05-07T20:32:23.1468248Z         else:
2025-05-07T20:32:23.1468558Z             scale_ub_tensor = None
2025-05-07T20:32:23.1468832Z     
2025-05-07T20:32:23.1469091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.1469435Z             op = silu_mul_quant
2025-05-07T20:32:23.1469704Z             if compiled:
2025-05-07T20:32:23.1469970Z                 op = torch.compile(op)
2025-05-07T20:32:23.1470287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1470580Z     
2025-05-07T20:32:23.1470789Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.1470968Z 
2025-05-07T20:32:23.1471079Z moe/activation_test.py:117: 
2025-05-07T20:32:23.1471396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1471747Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.1472047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.1472770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.1473489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.1474140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.1474862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.1475556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.1476114Z     kernel = self.compile(
2025-05-07T20:32:23.1476691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.1477433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.1477859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.1478098Z 
2025-05-07T20:32:23.1478314Z self = <triton.compiler.compiler.ASTSource object at 0x7f76777388b0>
2025-05-07T20:32:23.1479439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.1480867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca031c0>}
2025-05-07T20:32:23.1482387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.1483451Z context = <triton._C.libtriton.ir.context object at 0x7f7676f0fbb0>
2025-05-07T20:32:23.1483761Z 
2025-05-07T20:32:23.1483937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.1484486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.1484978Z                            module_map=module_map)
2025-05-07T20:32:23.1485357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.1485732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.1486009Z E       ^
2025-05-07T20:32:23.1486492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.1486965Z 
2025-05-07T20:32:23.1487400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.1487939Z 
2025-05-07T20:32:23.1488050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.1488620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.1489040Z     T=128,
2025-05-07T20:32:23.1489239Z     D=5120,
2025-05-07T20:32:23.1489446Z     scale_ub=1200.0,
2025-05-07T20:32:23.1489681Z     contiguous=True,
2025-05-07T20:32:23.1489912Z     compiled=False,
2025-05-07T20:32:23.1490130Z )
2025-05-07T20:32:23.3467682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.3468251Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:23.3468558Z 
2025-05-07T20:32:23.3468677Z     @given(
2025-05-07T20:32:23.3469045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.3469523Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.3469977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.3470402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.3470753Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.3471059Z     )
2025-05-07T20:32:23.3471430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.3471898Z     def test_silu_mul_quant(
2025-05-07T20:32:23.3472153Z         self,
2025-05-07T20:32:23.3472368Z         T: int,
2025-05-07T20:32:23.3472580Z         D: int,
2025-05-07T20:32:23.3472807Z         scale_ub: Optional[float],
2025-05-07T20:32:23.3473094Z         contiguous: bool,
2025-05-07T20:32:23.3473355Z         compiled: bool,
2025-05-07T20:32:23.3473690Z     ) -> None:
2025-05-07T20:32:23.3473917Z         torch.manual_seed(2025)
2025-05-07T20:32:23.3474177Z     
2025-05-07T20:32:23.3474470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.3474828Z     
2025-05-07T20:32:23.3475040Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.3475349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.3475674Z         x = x_sign * x_clamp
2025-05-07T20:32:23.3475933Z         x0 = x[:, :D]
2025-05-07T20:32:23.3476174Z         x1 = x[:, D:]
2025-05-07T20:32:23.3476392Z     
2025-05-07T20:32:23.3476597Z         if contiguous:
2025-05-07T20:32:23.3476846Z             x0 = x0.contiguous()
2025-05-07T20:32:23.3477118Z             x1 = x1.contiguous()
2025-05-07T20:32:23.3477373Z     
2025-05-07T20:32:23.3477581Z         if scale_ub is not None:
2025-05-07T20:32:23.3477868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.3478217Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.3478543Z             )
2025-05-07T20:32:23.3478744Z         else:
2025-05-07T20:32:23.3478972Z             scale_ub_tensor = None
2025-05-07T20:32:23.3479239Z     
2025-05-07T20:32:23.3479668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.3480002Z             op = silu_mul_quant
2025-05-07T20:32:23.3480268Z             if compiled:
2025-05-07T20:32:23.3480537Z                 op = torch.compile(op)
2025-05-07T20:32:23.3480847Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3481143Z     
2025-05-07T20:32:23.3481348Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.3481522Z 
2025-05-07T20:32:23.3481629Z moe/activation_test.py:117: 
2025-05-07T20:32:23.3481943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3482322Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.3482621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3483350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.3484079Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.3484653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.3485381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.3486077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.3486767Z     kernel = self.compile(
2025-05-07T20:32:23.3487342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.3488037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.3488454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3488700Z 
2025-05-07T20:32:23.3488918Z self = <triton.compiler.compiler.ASTSource object at 0x7f767c24f070>
2025-05-07T20:32:23.3490063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.3491522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eabac0>}
2025-05-07T20:32:23.3492941Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.3494017Z context = <triton._C.libtriton.ir.context object at 0x7f7676f065f0>
2025-05-07T20:32:23.3494321Z 
2025-05-07T20:32:23.3494495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.3495043Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.3495534Z                            module_map=module_map)
2025-05-07T20:32:23.3495922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.3496296Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.3496568Z E       ^
2025-05-07T20:32:23.3497063Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.3497548Z 
2025-05-07T20:32:23.3497989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.3498528Z 
2025-05-07T20:32:23.3498643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.3499075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.3499499Z     T=1,
2025-05-07T20:32:23.3499697Z     D=7168,
2025-05-07T20:32:23.3499903Z     scale_ub=1200.0,
2025-05-07T20:32:23.3500145Z     contiguous=True,
2025-05-07T20:32:23.3500381Z     compiled=True,
2025-05-07T20:32:23.3500601Z )
2025-05-07T20:32:23.3500937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.3501536Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:23.3501813Z 
2025-05-07T20:32:23.3501901Z     @given(
2025-05-07T20:32:23.3502145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.3502479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.3502803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.3503147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.3503492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.3503796Z     )
2025-05-07T20:32:23.3504166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.3504630Z     def test_silu_mul_quant(
2025-05-07T20:32:23.3504888Z         self,
2025-05-07T20:32:23.3505098Z         T: int,
2025-05-07T20:32:23.3505306Z         D: int,
2025-05-07T20:32:23.3505541Z         scale_ub: Optional[float],
2025-05-07T20:32:23.3505835Z         contiguous: bool,
2025-05-07T20:32:23.3506103Z         compiled: bool,
2025-05-07T20:32:23.3506344Z     ) -> None:
2025-05-07T20:32:23.3506586Z         torch.manual_seed(2025)
2025-05-07T20:32:23.3506866Z     
2025-05-07T20:32:23.3507181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.3507625Z     
2025-05-07T20:32:23.3507826Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.3508134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.3508463Z         x = x_sign * x_clamp
2025-05-07T20:32:23.3508712Z         x0 = x[:, :D]
2025-05-07T20:32:23.3508942Z         x1 = x[:, D:]
2025-05-07T20:32:23.3509168Z     
2025-05-07T20:32:23.3509362Z         if contiguous:
2025-05-07T20:32:23.3509608Z             x0 = x0.contiguous()
2025-05-07T20:32:23.3509881Z             x1 = x1.contiguous()
2025-05-07T20:32:23.3510138Z     
2025-05-07T20:32:23.3510338Z         if scale_ub is not None:
2025-05-07T20:32:23.3510626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.3510987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.3511309Z             )
2025-05-07T20:32:23.3511518Z         else:
2025-05-07T20:32:23.3511744Z             scale_ub_tensor = None
2025-05-07T20:32:23.3512011Z     
2025-05-07T20:32:23.3512261Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.3512591Z             op = silu_mul_quant
2025-05-07T20:32:23.3512853Z             if compiled:
2025-05-07T20:32:23.3513116Z                 op = torch.compile(op)
2025-05-07T20:32:23.3513430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3513796Z     
2025-05-07T20:32:23.3514005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.3514187Z 
2025-05-07T20:32:23.3514297Z moe/activation_test.py:117: 
2025-05-07T20:32:23.3514612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3514958Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.3515265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.3515861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.3516447Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.3517206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.3517938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.3518503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.3519223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.3519918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.3520482Z     kernel = self.compile(
2025-05-07T20:32:23.3521138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.3521830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.3522255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.3522501Z 
2025-05-07T20:32:23.3522727Z self = <triton.compiler.compiler.ASTSource object at 0x7f767c24e7a0>
2025-05-07T20:32:23.3524172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.3525714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea9d80>}
2025-05-07T20:32:23.3527121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.3528185Z context = <triton._C.libtriton.ir.context object at 0x7f7676f9cf70>
2025-05-07T20:32:23.3528484Z 
2025-05-07T20:32:23.3528659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.3529396Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.3529891Z                            module_map=module_map)
2025-05-07T20:32:23.3530275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.3530643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.3530922Z E       ^
2025-05-07T20:32:23.3531410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.3531883Z 
2025-05-07T20:32:23.3532317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.3532865Z 
2025-05-07T20:32:23.3532978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.3533414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.3533840Z     T=1,
2025-05-07T20:32:23.3534041Z     D=7168,
2025-05-07T20:32:23.3534249Z     scale_ub=1200.0,
2025-05-07T20:32:23.3534492Z     contiguous=False,
2025-05-07T20:32:23.3534728Z     compiled=True,
2025-05-07T20:32:23.3534950Z )
2025-05-07T20:32:23.6964730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.6966232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.6967009Z 
2025-05-07T20:32:23.6967108Z     @given(
2025-05-07T20:32:23.6967371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.6967701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.6968025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.6968383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.6968721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.6969026Z     )
2025-05-07T20:32:23.6969391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.6969852Z     def test_silu_mul_quant(
2025-05-07T20:32:23.6970111Z         self,
2025-05-07T20:32:23.6970318Z         T: int,
2025-05-07T20:32:23.6970523Z         D: int,
2025-05-07T20:32:23.6970758Z         scale_ub: Optional[float],
2025-05-07T20:32:23.6971044Z         contiguous: bool,
2025-05-07T20:32:23.6971292Z         compiled: bool,
2025-05-07T20:32:23.6971535Z     ) -> None:
2025-05-07T20:32:23.6971763Z         torch.manual_seed(2025)
2025-05-07T20:32:23.6972011Z     
2025-05-07T20:32:23.6972297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.6972655Z     
2025-05-07T20:32:23.6972852Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.6973159Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.6973664Z         x = x_sign * x_clamp
2025-05-07T20:32:23.6973924Z         x0 = x[:, :D]
2025-05-07T20:32:23.6974147Z         x1 = x[:, D:]
2025-05-07T20:32:23.6974369Z     
2025-05-07T20:32:23.6974562Z         if contiguous:
2025-05-07T20:32:23.6974806Z             x0 = x0.contiguous()
2025-05-07T20:32:23.6975080Z             x1 = x1.contiguous()
2025-05-07T20:32:23.6975333Z     
2025-05-07T20:32:23.6975531Z         if scale_ub is not None:
2025-05-07T20:32:23.6975820Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.6976168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.6976486Z             )
2025-05-07T20:32:23.6976689Z         else:
2025-05-07T20:32:23.6976918Z             scale_ub_tensor = None
2025-05-07T20:32:23.6977177Z     
2025-05-07T20:32:23.6977424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.6977756Z             op = silu_mul_quant
2025-05-07T20:32:23.6978014Z             if compiled:
2025-05-07T20:32:23.6978285Z                 op = torch.compile(op)
2025-05-07T20:32:23.6978597Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6978888Z     
2025-05-07T20:32:23.6979086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.6979388Z 
2025-05-07T20:32:23.6979493Z moe/activation_test.py:117: 
2025-05-07T20:32:23.6979803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6980146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.6980447Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.6981032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.6981609Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.6982297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.6983015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.6983579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.6984289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.6984982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.6985536Z     kernel = self.compile(
2025-05-07T20:32:23.6986100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.6986907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.6987429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.6987729Z 
2025-05-07T20:32:23.6988006Z self = <triton.compiler.compiler.ASTSource object at 0x7f767781e440>
2025-05-07T20:32:23.6989347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.6990838Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea83a0>}
2025-05-07T20:32:23.6992237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.6993309Z context = <triton._C.libtriton.ir.context object at 0x7f7677256eb0>
2025-05-07T20:32:23.6993717Z 
2025-05-07T20:32:23.6993900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.6994447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.6994933Z                            module_map=module_map)
2025-05-07T20:32:23.6995427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.6995807Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.6996076Z E       ^
2025-05-07T20:32:23.6996560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.6997065Z 
2025-05-07T20:32:23.6997522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.6998056Z 
2025-05-07T20:32:23.6998174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.6998603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.6999025Z     T=1,
2025-05-07T20:32:23.6999220Z     D=7168,
2025-05-07T20:32:23.6999423Z     scale_ub=None,
2025-05-07T20:32:23.6999650Z     contiguous=False,
2025-05-07T20:32:23.6999890Z     compiled=True,
2025-05-07T20:32:23.7000099Z )
2025-05-07T20:32:23.7985592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.7986389Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:23.7986772Z 
2025-05-07T20:32:23.7986894Z     @given(
2025-05-07T20:32:23.7987424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.7987756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.7988079Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.7988434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.7988782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.7989079Z     )
2025-05-07T20:32:23.7989452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.7989919Z     def test_silu_mul_quant(
2025-05-07T20:32:23.7990183Z         self,
2025-05-07T20:32:23.7990387Z         T: int,
2025-05-07T20:32:23.7990603Z         D: int,
2025-05-07T20:32:23.7990843Z         scale_ub: Optional[float],
2025-05-07T20:32:23.7991125Z         contiguous: bool,
2025-05-07T20:32:23.7991382Z         compiled: bool,
2025-05-07T20:32:23.7991627Z     ) -> None:
2025-05-07T20:32:23.7991851Z         torch.manual_seed(2025)
2025-05-07T20:32:23.7992120Z     
2025-05-07T20:32:23.7992414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.7992770Z     
2025-05-07T20:32:23.7992985Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.7993297Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.7993714Z         x = x_sign * x_clamp
2025-05-07T20:32:23.7993972Z         x0 = x[:, :D]
2025-05-07T20:32:23.7994205Z         x1 = x[:, D:]
2025-05-07T20:32:23.7994429Z     
2025-05-07T20:32:23.7994633Z         if contiguous:
2025-05-07T20:32:23.7994884Z             x0 = x0.contiguous()
2025-05-07T20:32:23.7995160Z             x1 = x1.contiguous()
2025-05-07T20:32:23.7995414Z     
2025-05-07T20:32:23.7995623Z         if scale_ub is not None:
2025-05-07T20:32:23.7995923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.7996273Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.7996598Z             )
2025-05-07T20:32:23.7996806Z         else:
2025-05-07T20:32:23.7997033Z             scale_ub_tensor = None
2025-05-07T20:32:23.7997301Z     
2025-05-07T20:32:23.7997548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7997874Z             op = silu_mul_quant
2025-05-07T20:32:23.7998139Z             if compiled:
2025-05-07T20:32:23.7998405Z                 op = torch.compile(op)
2025-05-07T20:32:23.7998713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.7999006Z     
2025-05-07T20:32:23.7999215Z         y_fp8, y_scale = fn()
2025-05-07T20:32:23.7999513Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:23.7999821Z     
2025-05-07T20:32:23.8000075Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.8000578Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:23.8000887Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:23.8001222Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:23.8001601Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.8001928Z     
2025-05-07T20:32:23.8002150Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:23.8002355Z 
2025-05-07T20:32:23.8002468Z moe/activation_test.py:126: 
2025-05-07T20:32:23.8002775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8003131Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:23.8003478Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.8004302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:23.8005080Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:23.8005658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.8006372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.8007173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:23.8007942Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.8008726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:23.8009511Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.8010273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:23.8010944Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:23.8011575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:23.8018847Z     fn()
2025-05-07T20:32:23.8019411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:23.8020036Z     self.fn.run(
2025-05-07T20:32:23.8020531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.8021092Z     kernel = self.compile(
2025-05-07T20:32:23.8021657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.8022346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.8022769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.8023011Z 
2025-05-07T20:32:23.8023235Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676f98a60>
2025-05-07T20:32:23.8024626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.8026061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eaa3b0>}
2025-05-07T20:32:23.8027453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.8028509Z context = <triton._C.libtriton.ir.context object at 0x7f767729d7f0>
2025-05-07T20:32:23.8028808Z 
2025-05-07T20:32:23.8028992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.8029701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.8030197Z                            module_map=module_map)
2025-05-07T20:32:23.8030583Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.8030955Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:23.8031245Z E       ^
2025-05-07T20:32:23.8031737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.8032204Z 
2025-05-07T20:32:23.8032644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.8033176Z 
2025-05-07T20:32:23.8033286Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.8033777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.8034200Z     T=1,
2025-05-07T20:32:23.8034392Z     D=5120,
2025-05-07T20:32:23.8034600Z     scale_ub=1200.0,
2025-05-07T20:32:23.8034842Z     contiguous=False,
2025-05-07T20:32:23.8035081Z     compiled=True,
2025-05-07T20:32:23.8035301Z )
2025-05-07T20:32:23.9745146Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.9745952Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:23.9746547Z 
2025-05-07T20:32:23.9746674Z     @given(
2025-05-07T20:32:23.9747147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.9748007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.9748747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.9749374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.9750008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.9750558Z     )
2025-05-07T20:32:23.9751222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.9752078Z     def test_silu_mul_quant(
2025-05-07T20:32:23.9752547Z         self,
2025-05-07T20:32:23.9752930Z         T: int,
2025-05-07T20:32:23.9753302Z         D: int,
2025-05-07T20:32:23.9753853Z         scale_ub: Optional[float],
2025-05-07T20:32:23.9754381Z         contiguous: bool,
2025-05-07T20:32:23.9754835Z         compiled: bool,
2025-05-07T20:32:23.9755277Z     ) -> None:
2025-05-07T20:32:23.9755700Z         torch.manual_seed(2025)
2025-05-07T20:32:23.9756156Z     
2025-05-07T20:32:23.9756677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.9757259Z     
2025-05-07T20:32:23.9757463Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.9757772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.9758102Z         x = x_sign * x_clamp
2025-05-07T20:32:23.9758362Z         x0 = x[:, :D]
2025-05-07T20:32:23.9758589Z         x1 = x[:, D:]
2025-05-07T20:32:23.9758813Z     
2025-05-07T20:32:23.9759015Z         if contiguous:
2025-05-07T20:32:23.9759260Z             x0 = x0.contiguous()
2025-05-07T20:32:23.9759537Z             x1 = x1.contiguous()
2025-05-07T20:32:23.9759802Z     
2025-05-07T20:32:23.9760004Z         if scale_ub is not None:
2025-05-07T20:32:23.9760297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.9760651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.9760982Z             )
2025-05-07T20:32:23.9761189Z         else:
2025-05-07T20:32:23.9761411Z             scale_ub_tensor = None
2025-05-07T20:32:23.9761675Z     
2025-05-07T20:32:23.9761927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.9762263Z             op = silu_mul_quant
2025-05-07T20:32:23.9762526Z             if compiled:
2025-05-07T20:32:23.9762788Z                 op = torch.compile(op)
2025-05-07T20:32:23.9763103Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9763392Z     
2025-05-07T20:32:23.9763600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.9763780Z 
2025-05-07T20:32:23.9763885Z moe/activation_test.py:117: 
2025-05-07T20:32:23.9764343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9764692Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.9764991Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9765585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:23.9766174Z     return fn(*args, **kwargs)
2025-05-07T20:32:23.9766869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.9767594Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.9768158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.9768868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.9769565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.9770128Z     kernel = self.compile(
2025-05-07T20:32:23.9770693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.9771384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.9771887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9772126Z 
2025-05-07T20:32:23.9772344Z self = <triton.compiler.compiler.ASTSource object at 0x7f76771995d0>
2025-05-07T20:32:23.9773471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.9774900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9c60>}
2025-05-07T20:32:23.9776307Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.9777432Z context = <triton._C.libtriton.ir.context object at 0x7f7676d71e30>
2025-05-07T20:32:23.9777733Z 
2025-05-07T20:32:23.9777917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.9778460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.9778962Z                            module_map=module_map)
2025-05-07T20:32:23.9779358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.9779727Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.9780004Z E       ^
2025-05-07T20:32:23.9780492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.9780964Z 
2025-05-07T20:32:23.9781407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.9781942Z 
2025-05-07T20:32:23.9782054Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.9782496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.9782918Z     T=1,
2025-05-07T20:32:23.9783111Z     D=5120,
2025-05-07T20:32:23.9783322Z     scale_ub=1200.0,
2025-05-07T20:32:23.9783561Z     contiguous=False,
2025-05-07T20:32:23.9783800Z     compiled=False,
2025-05-07T20:32:23.9784017Z )
2025-05-07T20:32:23.9784358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.9784880Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:23.9785160Z 
2025-05-07T20:32:23.9785242Z     @given(
2025-05-07T20:32:23.9785511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.9785928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.9786258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.9786608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.9786972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.9787316Z     )
2025-05-07T20:32:23.9787683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.9788149Z     def test_silu_mul_quant(
2025-05-07T20:32:23.9788408Z         self,
2025-05-07T20:32:23.9788617Z         T: int,
2025-05-07T20:32:23.9788824Z         D: int,
2025-05-07T20:32:23.9789059Z         scale_ub: Optional[float],
2025-05-07T20:32:23.9789348Z         contiguous: bool,
2025-05-07T20:32:23.9789600Z         compiled: bool,
2025-05-07T20:32:23.9789839Z     ) -> None:
2025-05-07T20:32:23.9790069Z         torch.manual_seed(2025)
2025-05-07T20:32:23.9790320Z     
2025-05-07T20:32:23.9790607Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.9790972Z     
2025-05-07T20:32:23.9791173Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.9791487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.9791815Z         x = x_sign * x_clamp
2025-05-07T20:32:23.9792149Z         x0 = x[:, :D]
2025-05-07T20:32:23.9792386Z         x1 = x[:, D:]
2025-05-07T20:32:23.9792609Z     
2025-05-07T20:32:23.9792803Z         if contiguous:
2025-05-07T20:32:23.9793048Z             x0 = x0.contiguous()
2025-05-07T20:32:23.9793322Z             x1 = x1.contiguous()
2025-05-07T20:32:23.9793625Z     
2025-05-07T20:32:23.9793832Z         if scale_ub is not None:
2025-05-07T20:32:23.9794124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.9794480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.9794803Z             )
2025-05-07T20:32:23.9795008Z         else:
2025-05-07T20:32:23.9795233Z             scale_ub_tensor = None
2025-05-07T20:32:23.9795497Z     
2025-05-07T20:32:23.9795749Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.9796080Z             op = silu_mul_quant
2025-05-07T20:32:23.9796344Z             if compiled:
2025-05-07T20:32:23.9796611Z                 op = torch.compile(op)
2025-05-07T20:32:23.9796937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9797226Z     
2025-05-07T20:32:23.9797434Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:23.9797609Z 
2025-05-07T20:32:23.9797719Z moe/activation_test.py:117: 
2025-05-07T20:32:23.9798029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9798378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:23.9798680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.9799405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:23.9800126Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:23.9800695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.9801413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.9802108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.9802671Z     kernel = self.compile(
2025-05-07T20:32:23.9803243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.9803932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.9804346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.9804589Z 
2025-05-07T20:32:23.9804804Z self = <triton.compiler.compiler.ASTSource object at 0x7f767729b820>
2025-05-07T20:32:23.9806020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.9807462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9750>}
2025-05-07T20:32:23.9808863Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.9809922Z context = <triton._C.libtriton.ir.context object at 0x7f7676d524b0>
2025-05-07T20:32:23.9810227Z 
2025-05-07T20:32:23.9810404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.9810950Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.9811446Z                            module_map=module_map)
2025-05-07T20:32:23.9811831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.9812203Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.9812478Z E       ^
2025-05-07T20:32:23.9812960Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.9813544Z 
2025-05-07T20:32:23.9813979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.9814519Z 
2025-05-07T20:32:23.9814632Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.9815061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.9815483Z     T=16384,
2025-05-07T20:32:23.9815686Z     D=5120,
2025-05-07T20:32:23.9815889Z     scale_ub=1200.0,
2025-05-07T20:32:23.9816121Z     contiguous=False,
2025-05-07T20:32:23.9816369Z     compiled=True,
2025-05-07T20:32:23.9816589Z )
2025-05-07T20:32:24.0839471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0840248Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.0840692Z 
2025-05-07T20:32:24.0840813Z     @given(
2025-05-07T20:32:24.0841200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0841650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0842106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0842462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0842806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0843116Z     )
2025-05-07T20:32:24.0843487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0843956Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0844209Z         self,
2025-05-07T20:32:24.0844421Z         T: int,
2025-05-07T20:32:24.0844635Z         D: int,
2025-05-07T20:32:24.0844864Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0845157Z         contiguous: bool,
2025-05-07T20:32:24.0845419Z         compiled: bool,
2025-05-07T20:32:24.0845654Z     ) -> None:
2025-05-07T20:32:24.0845887Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0846156Z     
2025-05-07T20:32:24.0846439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0846801Z     
2025-05-07T20:32:24.0847011Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0847315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0847649Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0847907Z         x0 = x[:, :D]
2025-05-07T20:32:24.0848133Z         x1 = x[:, D:]
2025-05-07T20:32:24.0848357Z     
2025-05-07T20:32:24.0848557Z         if contiguous:
2025-05-07T20:32:24.0848803Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0849078Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0849340Z     
2025-05-07T20:32:24.0849542Z         if scale_ub is not None:
2025-05-07T20:32:24.0850000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0850364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0850692Z             )
2025-05-07T20:32:24.0850895Z         else:
2025-05-07T20:32:24.0851126Z             scale_ub_tensor = None
2025-05-07T20:32:24.0851422Z     
2025-05-07T20:32:24.0851672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0852014Z             op = silu_mul_quant
2025-05-07T20:32:24.0852278Z             if compiled:
2025-05-07T20:32:24.0852547Z                 op = torch.compile(op)
2025-05-07T20:32:24.0852863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0853156Z     
2025-05-07T20:32:24.0853365Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0853540Z 
2025-05-07T20:32:24.0853657Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0853970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0854323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0854638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0855222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.0855811Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.0856633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0857358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0857919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0858637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0859336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0859898Z     kernel = self.compile(
2025-05-07T20:32:24.0860466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0861152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0861573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0861820Z 
2025-05-07T20:32:24.0862035Z self = <triton.compiler.compiler.ASTSource object at 0x7f767782c7c0>
2025-05-07T20:32:24.0863164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0864597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d96c0>}
2025-05-07T20:32:24.0866003Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0867075Z context = <triton._C.libtriton.ir.context object at 0x7f7676d29fb0>
2025-05-07T20:32:24.0867375Z 
2025-05-07T20:32:24.0867560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0868116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0868609Z                            module_map=module_map)
2025-05-07T20:32:24.0868998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0869367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0869648Z E       ^
2025-05-07T20:32:24.0870140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0870609Z 
2025-05-07T20:32:24.0871546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0872093Z 
2025-05-07T20:32:24.0872206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.0872644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.0873073Z     T=2048,
2025-05-07T20:32:24.0873279Z     D=7168,
2025-05-07T20:32:24.0873489Z     scale_ub=1200.0,
2025-05-07T20:32:24.0873835Z     contiguous=False,
2025-05-07T20:32:24.0874072Z     compiled=True,
2025-05-07T20:32:24.0874292Z )
2025-05-07T20:32:24.0874631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.0875147Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.0875441Z 
2025-05-07T20:32:24.0875527Z     @given(
2025-05-07T20:32:24.0875779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.0876109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.0876437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.0876805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.0877162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.0877463Z     )
2025-05-07T20:32:24.0877841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.0878401Z     def test_silu_mul_quant(
2025-05-07T20:32:24.0878660Z         self,
2025-05-07T20:32:24.0878870Z         T: int,
2025-05-07T20:32:24.0879090Z         D: int,
2025-05-07T20:32:24.0879325Z         scale_ub: Optional[float],
2025-05-07T20:32:24.0879618Z         contiguous: bool,
2025-05-07T20:32:24.0879879Z         compiled: bool,
2025-05-07T20:32:24.0880115Z     ) -> None:
2025-05-07T20:32:24.0880350Z         torch.manual_seed(2025)
2025-05-07T20:32:24.0880609Z     
2025-05-07T20:32:24.0880897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.0881263Z     
2025-05-07T20:32:24.0881469Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.0881785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.0882111Z         x = x_sign * x_clamp
2025-05-07T20:32:24.0882369Z         x0 = x[:, :D]
2025-05-07T20:32:24.0882602Z         x1 = x[:, D:]
2025-05-07T20:32:24.0882829Z     
2025-05-07T20:32:24.0883029Z         if contiguous:
2025-05-07T20:32:24.0883281Z             x0 = x0.contiguous()
2025-05-07T20:32:24.0883552Z             x1 = x1.contiguous()
2025-05-07T20:32:24.0883811Z     
2025-05-07T20:32:24.0884018Z         if scale_ub is not None:
2025-05-07T20:32:24.0884305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.0884665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.0884995Z             )
2025-05-07T20:32:24.0885201Z         else:
2025-05-07T20:32:24.0885434Z             scale_ub_tensor = None
2025-05-07T20:32:24.0885708Z     
2025-05-07T20:32:24.0885950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.0886284Z             op = silu_mul_quant
2025-05-07T20:32:24.0886557Z             if compiled:
2025-05-07T20:32:24.0886818Z                 op = torch.compile(op)
2025-05-07T20:32:24.0887137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0887432Z     
2025-05-07T20:32:24.0887648Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.0887823Z 
2025-05-07T20:32:24.0887929Z moe/activation_test.py:117: 
2025-05-07T20:32:24.0888247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0888598Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.0888898Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.0889485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.0890077Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.0890773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.0891582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.0892152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.0892872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.0893567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.0894132Z     kernel = self.compile(
2025-05-07T20:32:24.0894705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.0895397Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0895815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.0896063Z 
2025-05-07T20:32:24.0896281Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676ddd240>
2025-05-07T20:32:24.0897466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.0898895Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9d80>}
2025-05-07T20:32:24.0900367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.0901439Z context = <triton._C.libtriton.ir.context object at 0x7f7676aff0b0>
2025-05-07T20:32:24.0901745Z 
2025-05-07T20:32:24.0901922Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.0902473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0902969Z                            module_map=module_map)
2025-05-07T20:32:24.0903355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0903729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0904014Z E       ^
2025-05-07T20:32:24.0904499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0904975Z 
2025-05-07T20:32:24.0905408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0905944Z 
2025-05-07T20:32:24.2215873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2216339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2216831Z     T=1,
2025-05-07T20:32:24.2217284Z     D=5120,
2025-05-07T20:32:24.2217699Z     scale_ub=None,
2025-05-07T20:32:24.2218158Z     contiguous=False,
2025-05-07T20:32:24.2218638Z     compiled=False,
2025-05-07T20:32:24.2219081Z )
2025-05-07T20:32:24.2219761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2220794Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:24.2221355Z 
2025-05-07T20:32:24.2221521Z     @given(
2025-05-07T20:32:24.2222011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2222669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2223304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2224306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2224999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2225592Z     )
2025-05-07T20:32:24.2226339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2227087Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2227348Z         self,
2025-05-07T20:32:24.2227552Z         T: int,
2025-05-07T20:32:24.2227772Z         D: int,
2025-05-07T20:32:24.2228199Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2228492Z         contiguous: bool,
2025-05-07T20:32:24.2228760Z         compiled: bool,
2025-05-07T20:32:24.2236454Z     ) -> None:
2025-05-07T20:32:24.2236725Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2236987Z     
2025-05-07T20:32:24.2237283Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2237637Z     
2025-05-07T20:32:24.2237842Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2238150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2238471Z         x = x_sign * x_clamp
2025-05-07T20:32:24.2238722Z         x0 = x[:, :D]
2025-05-07T20:32:24.2238951Z         x1 = x[:, D:]
2025-05-07T20:32:24.2239166Z     
2025-05-07T20:32:24.2239362Z         if contiguous:
2025-05-07T20:32:24.2239600Z             x0 = x0.contiguous()
2025-05-07T20:32:24.2239867Z             x1 = x1.contiguous()
2025-05-07T20:32:24.2240118Z     
2025-05-07T20:32:24.2240327Z         if scale_ub is not None:
2025-05-07T20:32:24.2240610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.2240972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.2241296Z             )
2025-05-07T20:32:24.2241670Z         else:
2025-05-07T20:32:24.2241890Z             scale_ub_tensor = None
2025-05-07T20:32:24.2242155Z     
2025-05-07T20:32:24.2242402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.2242729Z             op = silu_mul_quant
2025-05-07T20:32:24.2242995Z             if compiled:
2025-05-07T20:32:24.2243258Z                 op = torch.compile(op)
2025-05-07T20:32:24.2243561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2243852Z     
2025-05-07T20:32:24.2244056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.2244231Z 
2025-05-07T20:32:24.2244337Z moe/activation_test.py:117: 
2025-05-07T20:32:24.2244649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2245003Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.2245304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2246025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.2246757Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.2247320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.2248034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.2248735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.2249297Z     kernel = self.compile(
2025-05-07T20:32:24.2249860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.2250553Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.2250964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2251208Z 
2025-05-07T20:32:24.2251424Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676a52740>
2025-05-07T20:32:24.2252553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.2253988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778db520>}
2025-05-07T20:32:24.2255384Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.2256835Z context = <triton._C.libtriton.ir.context object at 0x7f7676a9b830>
2025-05-07T20:32:24.2257143Z 
2025-05-07T20:32:24.2257318Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.2257860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.2258351Z                            module_map=module_map)
2025-05-07T20:32:24.2258734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.2259101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.2259366Z E       ^
2025-05-07T20:32:24.2259847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.2260318Z 
2025-05-07T20:32:24.2260751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.2261284Z 
2025-05-07T20:32:24.2261402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2261838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2262256Z     T=4096,
2025-05-07T20:32:24.2262456Z     D=7168,
2025-05-07T20:32:24.2262660Z     scale_ub=1200.0,
2025-05-07T20:32:24.2262981Z     contiguous=False,
2025-05-07T20:32:24.2263219Z     compiled=False,
2025-05-07T20:32:24.2263435Z )
2025-05-07T20:32:24.2263764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.2264281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.2264567Z 
2025-05-07T20:32:24.2264652Z     @given(
2025-05-07T20:32:24.2264890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.2265217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.2265538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.2265877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.2266229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.2266528Z     )
2025-05-07T20:32:24.2266897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.2267356Z     def test_silu_mul_quant(
2025-05-07T20:32:24.2267618Z         self,
2025-05-07T20:32:24.2267828Z         T: int,
2025-05-07T20:32:24.2268030Z         D: int,
2025-05-07T20:32:24.2268260Z         scale_ub: Optional[float],
2025-05-07T20:32:24.2268544Z         contiguous: bool,
2025-05-07T20:32:24.2268795Z         compiled: bool,
2025-05-07T20:32:24.2269031Z     ) -> None:
2025-05-07T20:32:24.2269254Z         torch.manual_seed(2025)
2025-05-07T20:32:24.2269499Z     
2025-05-07T20:32:24.2269783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.2270137Z     
2025-05-07T20:32:24.2270335Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.2270636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.2270961Z         x = x_sign * x_clamp
2025-05-07T20:32:24.2271212Z         x0 = x[:, :D]
2025-05-07T20:32:24.2271439Z         x1 = x[:, D:]
2025-05-07T20:32:24.2271656Z     
2025-05-07T20:32:24.2271845Z         if contiguous:
2025-05-07T20:32:24.2272090Z             x0 = x0.contiguous()
2025-05-07T20:32:24.2272367Z             x1 = x1.contiguous()
2025-05-07T20:32:24.2272617Z     
2025-05-07T20:32:24.2272811Z         if scale_ub is not None:
2025-05-07T20:32:24.2273104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.2273453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.2273826Z             )
2025-05-07T20:32:24.2274027Z         else:
2025-05-07T20:32:24.2274247Z             scale_ub_tensor = None
2025-05-07T20:32:24.2274504Z     
2025-05-07T20:32:24.2274745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.2275071Z             op = silu_mul_quant
2025-05-07T20:32:24.2275349Z             if compiled:
2025-05-07T20:32:24.2275604Z                 op = torch.compile(op)
2025-05-07T20:32:24.2276045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2276334Z     
2025-05-07T20:32:24.2276532Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.2276706Z 
2025-05-07T20:32:24.2276809Z moe/activation_test.py:117: 
2025-05-07T20:32:24.2277123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2277464Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.2277763Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.2278484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.2279207Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.2279761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.2280470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.2281166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.2281716Z     kernel = self.compile(
2025-05-07T20:32:24.2282280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.2283042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.2283459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.2283697Z 
2025-05-07T20:32:24.2283914Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676a68b80>
2025-05-07T20:32:24.2285036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.2286466Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb0550>}
2025-05-07T20:32:24.2287880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.2288938Z context = <triton._C.libtriton.ir.context object at 0x7f7676c0d3f0>
2025-05-07T20:32:24.2289233Z 
2025-05-07T20:32:24.2289406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.2289944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.2290426Z                            module_map=module_map)
2025-05-07T20:32:24.2290800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.2291166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.2291437Z E       ^
2025-05-07T20:32:24.2291922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.2292387Z 
2025-05-07T20:32:24.2292817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.2293357Z 
2025-05-07T20:32:24.2293466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.2293895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.2294307Z     T=16384,
2025-05-07T20:32:24.2294506Z     D=7168,
2025-05-07T20:32:24.2294707Z     scale_ub=None,
2025-05-07T20:32:24.2294929Z     contiguous=True,
2025-05-07T20:32:24.2295155Z     compiled=True,
2025-05-07T20:32:24.2295372Z )
2025-05-07T20:32:24.4272175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4272785Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.4273081Z 
2025-05-07T20:32:24.4273169Z     @given(
2025-05-07T20:32:24.4273692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4274038Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4274359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4274712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4275069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4275382Z     )
2025-05-07T20:32:24.4275752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4276227Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4276490Z         self,
2025-05-07T20:32:24.4276697Z         T: int,
2025-05-07T20:32:24.4276914Z         D: int,
2025-05-07T20:32:24.4277150Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4277434Z         contiguous: bool,
2025-05-07T20:32:24.4277694Z         compiled: bool,
2025-05-07T20:32:24.4277937Z     ) -> None:
2025-05-07T20:32:24.4278169Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4278432Z     
2025-05-07T20:32:24.4278732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4279093Z     
2025-05-07T20:32:24.4279301Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.4279609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.4280062Z         x = x_sign * x_clamp
2025-05-07T20:32:24.4280328Z         x0 = x[:, :D]
2025-05-07T20:32:24.4280560Z         x1 = x[:, D:]
2025-05-07T20:32:24.4280782Z     
2025-05-07T20:32:24.4280975Z         if contiguous:
2025-05-07T20:32:24.4281232Z             x0 = x0.contiguous()
2025-05-07T20:32:24.4281514Z             x1 = x1.contiguous()
2025-05-07T20:32:24.4281774Z     
2025-05-07T20:32:24.4281991Z         if scale_ub is not None:
2025-05-07T20:32:24.4282293Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.4282658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.4282988Z             )
2025-05-07T20:32:24.4283201Z         else:
2025-05-07T20:32:24.4283443Z             scale_ub_tensor = None
2025-05-07T20:32:24.4283712Z     
2025-05-07T20:32:24.4283964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.4284304Z             op = silu_mul_quant
2025-05-07T20:32:24.4284570Z             if compiled:
2025-05-07T20:32:24.4284846Z                 op = torch.compile(op)
2025-05-07T20:32:24.4285165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.4285459Z     
2025-05-07T20:32:24.4285674Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.4285854Z 
2025-05-07T20:32:24.4285974Z moe/activation_test.py:117: 
2025-05-07T20:32:24.4286288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.4286646Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.4286955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.4287553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.4288148Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.4288853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.4289590Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.4290160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.4290883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.4291585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.4292151Z     kernel = self.compile(
2025-05-07T20:32:24.4292721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.4293421Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.4293846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.4294209Z 
2025-05-07T20:32:24.4294438Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677198d60>
2025-05-07T20:32:24.4295572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.4297026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb1360>}
2025-05-07T20:32:24.4298433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.4299509Z context = <triton._C.libtriton.ir.context object at 0x7f7676c20cf0>
2025-05-07T20:32:24.4299811Z 
2025-05-07T20:32:24.4300001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.4300547Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.4301049Z                            module_map=module_map)
2025-05-07T20:32:24.4301518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.4301886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.4302168Z E       ^
2025-05-07T20:32:24.4302665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.4303139Z 
2025-05-07T20:32:24.4303586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.4304122Z 
2025-05-07T20:32:24.4304239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.4304680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.4305107Z     T=4096,
2025-05-07T20:32:24.4305313Z     D=5120,
2025-05-07T20:32:24.4305527Z     scale_ub=None,
2025-05-07T20:32:24.4305763Z     contiguous=False,
2025-05-07T20:32:24.4306027Z     compiled=True,
2025-05-07T20:32:24.4306241Z )
2025-05-07T20:32:24.4306587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.4307108Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:24.4307395Z 
2025-05-07T20:32:24.4307487Z     @given(
2025-05-07T20:32:24.4307728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.4308063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.4308392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.4308742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.4309097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.4309405Z     )
2025-05-07T20:32:24.4309779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.4310250Z     def test_silu_mul_quant(
2025-05-07T20:32:24.4310509Z         self,
2025-05-07T20:32:24.4310716Z         T: int,
2025-05-07T20:32:24.4310930Z         D: int,
2025-05-07T20:32:24.4311170Z         scale_ub: Optional[float],
2025-05-07T20:32:24.4311459Z         contiguous: bool,
2025-05-07T20:32:24.4311719Z         compiled: bool,
2025-05-07T20:32:24.4311962Z     ) -> None:
2025-05-07T20:32:24.4312198Z         torch.manual_seed(2025)
2025-05-07T20:32:24.4312451Z     
2025-05-07T20:32:24.4312742Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.4313108Z     
2025-05-07T20:32:24.4313312Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.4313686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.4314021Z         x = x_sign * x_clamp
2025-05-07T20:32:24.4314275Z         x0 = x[:, :D]
2025-05-07T20:32:24.4314510Z         x1 = x[:, D:]
2025-05-07T20:32:24.4314737Z     
2025-05-07T20:32:24.4315021Z         if contiguous:
2025-05-07T20:32:24.4315275Z             x0 = x0.contiguous()
2025-05-07T20:32:24.4315550Z             x1 = x1.contiguous()
2025-05-07T20:32:24.4315804Z     
2025-05-07T20:32:24.4316014Z         if scale_ub is not None:
2025-05-07T20:32:24.4316314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.4316667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.4317000Z             )
2025-05-07T20:32:24.4317211Z         else:
2025-05-07T20:32:24.4317437Z             scale_ub_tensor = None
2025-05-07T20:32:24.4317710Z     
2025-05-07T20:32:24.4317964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.4318307Z             op = silu_mul_quant
2025-05-07T20:32:24.4318570Z             if compiled:
2025-05-07T20:32:24.4318838Z                 op = torch.compile(op)
2025-05-07T20:32:24.4319155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.4319447Z     
2025-05-07T20:32:24.4319661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.4319844Z 
2025-05-07T20:32:24.4319956Z moe/activation_test.py:117: 
2025-05-07T20:32:24.4320268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.4320710Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.4321015Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.4321603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.4322206Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.4322907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.4323641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.4324405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.4325130Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.4325839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.4326403Z     kernel = self.compile(
2025-05-07T20:32:24.4326969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.4327673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.4328093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.4328337Z 
2025-05-07T20:32:24.4328555Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676c080d0>
2025-05-07T20:32:24.4329689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.4331130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb1ea0>}
2025-05-07T20:32:24.4332538Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.4333616Z context = <triton._C.libtriton.ir.context object at 0x7f76771ee6b0>
2025-05-07T20:32:24.4333919Z 
2025-05-07T20:32:24.4334096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.4334648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.4335147Z                            module_map=module_map)
2025-05-07T20:32:24.4335538Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.4335907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.4336186Z E       ^
2025-05-07T20:32:24.4336810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.4337285Z 
2025-05-07T20:32:24.4337721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.4338267Z 
2025-05-07T20:32:24.8002296Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8003254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8004113Z     T=4096,
2025-05-07T20:32:24.8004531Z     D=5120,
2025-05-07T20:32:24.8004954Z     scale_ub=1200.0,
2025-05-07T20:32:24.8005440Z     contiguous=False,
2025-05-07T20:32:24.8005925Z     compiled=False,
2025-05-07T20:32:24.8006367Z )
2025-05-07T20:32:24.8007031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8007644Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.8007938Z 
2025-05-07T20:32:24.8008039Z     @given(
2025-05-07T20:32:24.8008286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8008621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8008951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8009467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8009817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8010126Z     )
2025-05-07T20:32:24.8010504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8010970Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8011234Z         self,
2025-05-07T20:32:24.8011445Z         T: int,
2025-05-07T20:32:24.8011658Z         D: int,
2025-05-07T20:32:24.8011894Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8012188Z         contiguous: bool,
2025-05-07T20:32:24.8012444Z         compiled: bool,
2025-05-07T20:32:24.8012689Z     ) -> None:
2025-05-07T20:32:24.8012931Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8013193Z     
2025-05-07T20:32:24.8013487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8013854Z     
2025-05-07T20:32:24.8014062Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8014380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8014719Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8014982Z         x0 = x[:, :D]
2025-05-07T20:32:24.8015214Z         x1 = x[:, D:]
2025-05-07T20:32:24.8015441Z     
2025-05-07T20:32:24.8015649Z         if contiguous:
2025-05-07T20:32:24.8015895Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8016178Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8016444Z     
2025-05-07T20:32:24.8016651Z         if scale_ub is not None:
2025-05-07T20:32:24.8016951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8017310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8017639Z             )
2025-05-07T20:32:24.8017857Z         else:
2025-05-07T20:32:24.8018095Z             scale_ub_tensor = None
2025-05-07T20:32:24.8018363Z     
2025-05-07T20:32:24.8018614Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8018963Z             op = silu_mul_quant
2025-05-07T20:32:24.8019227Z             if compiled:
2025-05-07T20:32:24.8019496Z                 op = torch.compile(op)
2025-05-07T20:32:24.8019826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8020123Z     
2025-05-07T20:32:24.8020330Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.8020511Z 
2025-05-07T20:32:24.8020619Z moe/activation_test.py:117: 
2025-05-07T20:32:24.8020938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8021291Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.8021587Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8029014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.8029791Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.8030370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8031095Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8031793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8032351Z     kernel = self.compile(
2025-05-07T20:32:24.8032928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8033699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8034118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8034368Z 
2025-05-07T20:32:24.8034586Z self = <triton.compiler.compiler.ASTSource object at 0x7f767719b1c0>
2025-05-07T20:32:24.8035715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8037268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb2680>}
2025-05-07T20:32:24.8038666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8039726Z context = <triton._C.libtriton.ir.context object at 0x7f767711e670>
2025-05-07T20:32:24.8040027Z 
2025-05-07T20:32:24.8040203Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8040758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8041255Z                            module_map=module_map)
2025-05-07T20:32:24.8041637Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8042021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.8042304Z E       ^
2025-05-07T20:32:24.8042789Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8043270Z 
2025-05-07T20:32:24.8043704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8044241Z 
2025-05-07T20:32:24.8044352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.8044788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.8045206Z     T=4096,
2025-05-07T20:32:24.8045406Z     D=5120,
2025-05-07T20:32:24.8045619Z     scale_ub=1200.0,
2025-05-07T20:32:24.8045859Z     contiguous=False,
2025-05-07T20:32:24.8046103Z     compiled=True,
2025-05-07T20:32:24.8046323Z )
2025-05-07T20:32:24.8046657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.8047184Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:24.8047477Z 
2025-05-07T20:32:24.8047562Z     @given(
2025-05-07T20:32:24.8047813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.8048140Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.8048467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.8048818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.8049162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.8049469Z     )
2025-05-07T20:32:24.8049843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.8050300Z     def test_silu_mul_quant(
2025-05-07T20:32:24.8050585Z         self,
2025-05-07T20:32:24.8050876Z         T: int,
2025-05-07T20:32:24.8051087Z         D: int,
2025-05-07T20:32:24.8051314Z         scale_ub: Optional[float],
2025-05-07T20:32:24.8051602Z         contiguous: bool,
2025-05-07T20:32:24.8051865Z         compiled: bool,
2025-05-07T20:32:24.8052101Z     ) -> None:
2025-05-07T20:32:24.8052335Z         torch.manual_seed(2025)
2025-05-07T20:32:24.8052591Z     
2025-05-07T20:32:24.8052883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.8053241Z     
2025-05-07T20:32:24.8053450Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.8053762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.8054086Z         x = x_sign * x_clamp
2025-05-07T20:32:24.8054346Z         x0 = x[:, :D]
2025-05-07T20:32:24.8054581Z         x1 = x[:, D:]
2025-05-07T20:32:24.8054799Z     
2025-05-07T20:32:24.8054997Z         if contiguous:
2025-05-07T20:32:24.8055248Z             x0 = x0.contiguous()
2025-05-07T20:32:24.8055524Z             x1 = x1.contiguous()
2025-05-07T20:32:24.8055784Z     
2025-05-07T20:32:24.8055991Z         if scale_ub is not None:
2025-05-07T20:32:24.8056275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.8056628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.8057052Z             )
2025-05-07T20:32:24.8057255Z         else:
2025-05-07T20:32:24.8057482Z             scale_ub_tensor = None
2025-05-07T20:32:24.8057746Z     
2025-05-07T20:32:24.8057985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.8058314Z             op = silu_mul_quant
2025-05-07T20:32:24.8058581Z             if compiled:
2025-05-07T20:32:24.8058842Z                 op = torch.compile(op)
2025-05-07T20:32:24.8059151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8059444Z     
2025-05-07T20:32:24.8059654Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.8059828Z 
2025-05-07T20:32:24.8059933Z moe/activation_test.py:117: 
2025-05-07T20:32:24.8060255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8060607Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.8060902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.8061503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:24.8062090Z     return fn(*args, **kwargs)
2025-05-07T20:32:24.8062784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.8063498Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.8064062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.8064775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.8065465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.8066029Z     kernel = self.compile(
2025-05-07T20:32:24.8066596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.8067294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.8067758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.8068007Z 
2025-05-07T20:32:24.8068222Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676cd9660>
2025-05-07T20:32:24.8069342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.8070762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb3ac0>}
2025-05-07T20:32:24.8072234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.8073302Z context = <triton._C.libtriton.ir.context object at 0x7f76771a99b0>
2025-05-07T20:32:24.8073681Z 
2025-05-07T20:32:24.8073855Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.8074402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.8074891Z                            module_map=module_map)
2025-05-07T20:32:24.8075278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.8075651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.8075930Z E       ^
2025-05-07T20:32:24.8076411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.8076891Z 
2025-05-07T20:32:24.8077348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.8077903Z 
2025-05-07T20:32:24.9392574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9393046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9393493Z     T=2048,
2025-05-07T20:32:24.9393783Z     D=7168,
2025-05-07T20:32:24.9393997Z     scale_ub=1200.0,
2025-05-07T20:32:24.9394249Z     contiguous=False,
2025-05-07T20:32:24.9394504Z     compiled=False,
2025-05-07T20:32:24.9394731Z )
2025-05-07T20:32:24.9395104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9395643Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:24.9395942Z 
2025-05-07T20:32:24.9396028Z     @given(
2025-05-07T20:32:24.9396282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9396632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9396963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9397324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9397690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9398002Z     )
2025-05-07T20:32:24.9398381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9398860Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9399130Z         self,
2025-05-07T20:32:24.9399340Z         T: int,
2025-05-07T20:32:24.9399557Z         D: int,
2025-05-07T20:32:24.9399798Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9400090Z         contiguous: bool,
2025-05-07T20:32:24.9400354Z         compiled: bool,
2025-05-07T20:32:24.9400603Z     ) -> None:
2025-05-07T20:32:24.9400836Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9401102Z     
2025-05-07T20:32:24.9401404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9401775Z     
2025-05-07T20:32:24.9401986Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9402310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9402655Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9402917Z         x0 = x[:, :D]
2025-05-07T20:32:24.9403157Z         x1 = x[:, D:]
2025-05-07T20:32:24.9403387Z     
2025-05-07T20:32:24.9403587Z         if contiguous:
2025-05-07T20:32:24.9403842Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9404124Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9404385Z     
2025-05-07T20:32:24.9404599Z         if scale_ub is not None:
2025-05-07T20:32:24.9404900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9405261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9405600Z             )
2025-05-07T20:32:24.9405814Z         else:
2025-05-07T20:32:24.9406041Z             scale_ub_tensor = None
2025-05-07T20:32:24.9406320Z     
2025-05-07T20:32:24.9406736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9407080Z             op = silu_mul_quant
2025-05-07T20:32:24.9407354Z             if compiled:
2025-05-07T20:32:24.9407631Z                 op = torch.compile(op)
2025-05-07T20:32:24.9407960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9408265Z     
2025-05-07T20:32:24.9408477Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.9408659Z 
2025-05-07T20:32:24.9408772Z moe/activation_test.py:117: 
2025-05-07T20:32:24.9409091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9409453Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.9409760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9410502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.9411245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.9411833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9412573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9413441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9414019Z     kernel = self.compile(
2025-05-07T20:32:24.9414610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9415312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9415741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9415995Z 
2025-05-07T20:32:24.9416227Z self = <triton.compiler.compiler.ASTSource object at 0x7f767715bcd0>
2025-05-07T20:32:24.9417397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9418866Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eaa290>}
2025-05-07T20:32:24.9420308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9421405Z context = <triton._C.libtriton.ir.context object at 0x7f76769a16f0>
2025-05-07T20:32:24.9421717Z 
2025-05-07T20:32:24.9421903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9422471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9422976Z                            module_map=module_map)
2025-05-07T20:32:24.9423370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9423755Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.9424206Z E       ^
2025-05-07T20:32:24.9424729Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9425214Z 
2025-05-07T20:32:24.9425661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.9426208Z 
2025-05-07T20:32:24.9426327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9426770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9427202Z     T=1,
2025-05-07T20:32:24.9427435Z     D=7168,
2025-05-07T20:32:24.9427658Z     scale_ub=None,
2025-05-07T20:32:24.9427891Z     contiguous=True,
2025-05-07T20:32:24.9428138Z     compiled=False,
2025-05-07T20:32:24.9428361Z )
2025-05-07T20:32:24.9428854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9429392Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:24.9429675Z 
2025-05-07T20:32:24.9429775Z     @given(
2025-05-07T20:32:24.9430031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9430376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9430714Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9431073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9431435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9431752Z     )
2025-05-07T20:32:24.9432132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9432614Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9432884Z         self,
2025-05-07T20:32:24.9433095Z         T: int,
2025-05-07T20:32:24.9433318Z         D: int,
2025-05-07T20:32:24.9433639Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9433936Z         contiguous: bool,
2025-05-07T20:32:24.9434203Z         compiled: bool,
2025-05-07T20:32:24.9434453Z     ) -> None:
2025-05-07T20:32:24.9434693Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9435090Z     
2025-05-07T20:32:24.9435392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9435767Z     
2025-05-07T20:32:24.9435977Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9436297Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9436639Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9436898Z         x0 = x[:, :D]
2025-05-07T20:32:24.9437140Z         x1 = x[:, D:]
2025-05-07T20:32:24.9437402Z     
2025-05-07T20:32:24.9437619Z         if contiguous:
2025-05-07T20:32:24.9437876Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9438163Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9438425Z     
2025-05-07T20:32:24.9438641Z         if scale_ub is not None:
2025-05-07T20:32:24.9438950Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9439314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9439654Z             )
2025-05-07T20:32:24.9439875Z         else:
2025-05-07T20:32:24.9440110Z             scale_ub_tensor = None
2025-05-07T20:32:24.9440389Z     
2025-05-07T20:32:24.9440644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9440994Z             op = silu_mul_quant
2025-05-07T20:32:24.9441267Z             if compiled:
2025-05-07T20:32:24.9441543Z                 op = torch.compile(op)
2025-05-07T20:32:24.9441871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9442170Z     
2025-05-07T20:32:24.9442386Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:24.9442567Z 
2025-05-07T20:32:24.9442684Z moe/activation_test.py:117: 
2025-05-07T20:32:24.9443006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9443373Z moe/activation_test.py:115: in fn
2025-05-07T20:32:24.9443685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9444437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:24.9445186Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:24.9445770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9446517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9447231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9447811Z     kernel = self.compile(
2025-05-07T20:32:24.9448400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9449114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9449629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9449884Z 
2025-05-07T20:32:24.9450106Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677185690>
2025-05-07T20:32:24.9451275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9452751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76769884c0>}
2025-05-07T20:32:24.9454192Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9455297Z context = <triton._C.libtriton.ir.context object at 0x7f76769ae5f0>
2025-05-07T20:32:24.9455618Z 
2025-05-07T20:32:24.9455804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9456373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9456963Z                            module_map=module_map)
2025-05-07T20:32:24.9457365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9457751Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.9458041Z E       ^
2025-05-07T20:32:24.9458544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9459035Z 
2025-05-07T20:32:24.9459484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.9460037Z 
2025-05-07T20:32:24.9460158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9460612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9461051Z     T=16384,
2025-05-07T20:32:24.9461266Z     D=7168,
2025-05-07T20:32:24.9461481Z     scale_ub=1200.0,
2025-05-07T20:32:24.9461728Z     contiguous=False,
2025-05-07T20:32:24.9461987Z     compiled=True,
2025-05-07T20:32:25.2226671Z )
2025-05-07T20:32:25.2227320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2228149Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.2228563Z 
2025-05-07T20:32:25.2228683Z     @given(
2025-05-07T20:32:25.2229041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2229502Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2229981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2230336Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2230681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2230991Z     )
2025-05-07T20:32:25.2231375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2231838Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2232098Z         self,
2025-05-07T20:32:25.2232314Z         T: int,
2025-05-07T20:32:25.2232522Z         D: int,
2025-05-07T20:32:25.2232753Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2233043Z         contiguous: bool,
2025-05-07T20:32:25.2233301Z         compiled: bool,
2025-05-07T20:32:25.2233629Z     ) -> None:
2025-05-07T20:32:25.2233862Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2234120Z     
2025-05-07T20:32:25.2234406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2234765Z     
2025-05-07T20:32:25.2234970Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2235273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2235601Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2235863Z         x0 = x[:, :D]
2025-05-07T20:32:25.2236277Z         x1 = x[:, D:]
2025-05-07T20:32:25.2236503Z     
2025-05-07T20:32:25.2236701Z         if contiguous:
2025-05-07T20:32:25.2236943Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2237218Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2237478Z     
2025-05-07T20:32:25.2237680Z         if scale_ub is not None:
2025-05-07T20:32:25.2237971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2238324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2238651Z             )
2025-05-07T20:32:25.2238853Z         else:
2025-05-07T20:32:25.2239078Z             scale_ub_tensor = None
2025-05-07T20:32:25.2239345Z     
2025-05-07T20:32:25.2239585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2239921Z             op = silu_mul_quant
2025-05-07T20:32:25.2240191Z             if compiled:
2025-05-07T20:32:25.2240449Z                 op = torch.compile(op)
2025-05-07T20:32:25.2240763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2241066Z     
2025-05-07T20:32:25.2241265Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2241446Z 
2025-05-07T20:32:25.2241552Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2241868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2242339Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2242641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2243234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.2243827Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.2244518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2245244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2245814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2246537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2247237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2247807Z     kernel = self.compile(
2025-05-07T20:32:25.2248382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2249069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2249493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2249737Z 
2025-05-07T20:32:25.2249954Z self = <triton.compiler.compiler.ASTSource object at 0x7f767719a770>
2025-05-07T20:32:25.2251090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2252529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76769895a0>}
2025-05-07T20:32:25.2253939Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2255017Z context = <triton._C.libtriton.ir.context object at 0x7f7676b7f270>
2025-05-07T20:32:25.2255317Z 
2025-05-07T20:32:25.2255500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2256049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2256543Z                            module_map=module_map)
2025-05-07T20:32:25.2256926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2257401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2257701Z E       ^
2025-05-07T20:32:25.2258187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2258664Z 
2025-05-07T20:32:25.2259105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2259641Z 
2025-05-07T20:32:25.2259759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2260188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2260615Z     T=1,
2025-05-07T20:32:25.2260819Z     D=7168,
2025-05-07T20:32:25.2261020Z     scale_ub=None,
2025-05-07T20:32:25.2261254Z     contiguous=False,
2025-05-07T20:32:25.2261495Z     compiled=False,
2025-05-07T20:32:25.2261709Z )
2025-05-07T20:32:25.2262049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.2262579Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.2262855Z 
2025-05-07T20:32:25.2262940Z     @given(
2025-05-07T20:32:25.2263188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.2263520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.2263931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.2264274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.2264622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.2264924Z     )
2025-05-07T20:32:25.2265291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.2265760Z     def test_silu_mul_quant(
2025-05-07T20:32:25.2266018Z         self,
2025-05-07T20:32:25.2266219Z         T: int,
2025-05-07T20:32:25.2266431Z         D: int,
2025-05-07T20:32:25.2266663Z         scale_ub: Optional[float],
2025-05-07T20:32:25.2266952Z         contiguous: bool,
2025-05-07T20:32:25.2267207Z         compiled: bool,
2025-05-07T20:32:25.2267478Z     ) -> None:
2025-05-07T20:32:25.2267726Z         torch.manual_seed(2025)
2025-05-07T20:32:25.2274990Z     
2025-05-07T20:32:25.2275314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.2275697Z     
2025-05-07T20:32:25.2275901Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.2276213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.2276544Z         x = x_sign * x_clamp
2025-05-07T20:32:25.2276795Z         x0 = x[:, :D]
2025-05-07T20:32:25.2277023Z         x1 = x[:, D:]
2025-05-07T20:32:25.2277248Z     
2025-05-07T20:32:25.2277444Z         if contiguous:
2025-05-07T20:32:25.2277692Z             x0 = x0.contiguous()
2025-05-07T20:32:25.2277965Z             x1 = x1.contiguous()
2025-05-07T20:32:25.2278217Z     
2025-05-07T20:32:25.2278414Z         if scale_ub is not None:
2025-05-07T20:32:25.2278701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.2279061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.2279395Z             )
2025-05-07T20:32:25.2279601Z         else:
2025-05-07T20:32:25.2279827Z             scale_ub_tensor = None
2025-05-07T20:32:25.2280090Z     
2025-05-07T20:32:25.2280344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.2280677Z             op = silu_mul_quant
2025-05-07T20:32:25.2280938Z             if compiled:
2025-05-07T20:32:25.2281204Z                 op = torch.compile(op)
2025-05-07T20:32:25.2281517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2281805Z     
2025-05-07T20:32:25.2282015Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.2282189Z 
2025-05-07T20:32:25.2282303Z moe/activation_test.py:117: 
2025-05-07T20:32:25.2282614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2282969Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.2283269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.2284112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.2284836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.2285402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.2286124Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.2286818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.2287422Z     kernel = self.compile(
2025-05-07T20:32:25.2287997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.2288695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.2289109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.2289352Z 
2025-05-07T20:32:25.2289572Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676bc2b60>
2025-05-07T20:32:25.2290699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.2292212Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676989d80>}
2025-05-07T20:32:25.2293615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.2294685Z context = <triton._C.libtriton.ir.context object at 0x7f7676b239f0>
2025-05-07T20:32:25.2294987Z 
2025-05-07T20:32:25.2295168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.2295719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.2296206Z                            module_map=module_map)
2025-05-07T20:32:25.2296599Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.2296968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.2297238Z E       ^
2025-05-07T20:32:25.2297725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.2298195Z 
2025-05-07T20:32:25.2298632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.2299166Z 
2025-05-07T20:32:25.2299282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.2299714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.2300132Z     T=2048,
2025-05-07T20:32:25.2300331Z     D=7168,
2025-05-07T20:32:25.2300537Z     scale_ub=None,
2025-05-07T20:32:25.2300771Z     contiguous=False,
2025-05-07T20:32:25.2301010Z     compiled=True,
2025-05-07T20:32:25.2301221Z )
2025-05-07T20:32:25.3326980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3327606Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3328028Z 
2025-05-07T20:32:25.3328155Z     @given(
2025-05-07T20:32:25.3328474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3328909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3329334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3329731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3330080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3330386Z     )
2025-05-07T20:32:25.3330757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3331391Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3331656Z         self,
2025-05-07T20:32:25.3331865Z         T: int,
2025-05-07T20:32:25.3332074Z         D: int,
2025-05-07T20:32:25.3332313Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3332604Z         contiguous: bool,
2025-05-07T20:32:25.3332861Z         compiled: bool,
2025-05-07T20:32:25.3333103Z     ) -> None:
2025-05-07T20:32:25.3333333Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3333591Z     
2025-05-07T20:32:25.3333884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3334245Z     
2025-05-07T20:32:25.3334448Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3334763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3335097Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3335356Z         x0 = x[:, :D]
2025-05-07T20:32:25.3335583Z         x1 = x[:, D:]
2025-05-07T20:32:25.3335807Z     
2025-05-07T20:32:25.3336007Z         if contiguous:
2025-05-07T20:32:25.3336259Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3336539Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3336800Z     
2025-05-07T20:32:25.3337002Z         if scale_ub is not None:
2025-05-07T20:32:25.3337292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3337785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3338111Z             )
2025-05-07T20:32:25.3338321Z         else:
2025-05-07T20:32:25.3338552Z             scale_ub_tensor = None
2025-05-07T20:32:25.3338818Z     
2025-05-07T20:32:25.3339068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3339406Z             op = silu_mul_quant
2025-05-07T20:32:25.3339670Z             if compiled:
2025-05-07T20:32:25.3339938Z                 op = torch.compile(op)
2025-05-07T20:32:25.3340256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3340549Z     
2025-05-07T20:32:25.3340753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3340933Z 
2025-05-07T20:32:25.3341051Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3341365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3341712Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3342021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3342618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3343205Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3343899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3344623Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3345190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3345901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3346599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3347164Z     kernel = self.compile(
2025-05-07T20:32:25.3347782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3348483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3348904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3349142Z 
2025-05-07T20:32:25.3349363Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676b5f8e0>
2025-05-07T20:32:25.3350483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3351997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767698af80>}
2025-05-07T20:32:25.3353401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3354542Z context = <triton._C.libtriton.ir.context object at 0x7f76768572f0>
2025-05-07T20:32:25.3354843Z 
2025-05-07T20:32:25.3355024Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3355565Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3356056Z                            module_map=module_map)
2025-05-07T20:32:25.3356441Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3356812Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3357094Z E       ^
2025-05-07T20:32:25.3357587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3358058Z 
2025-05-07T20:32:25.3358500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3359117Z 
2025-05-07T20:32:25.3359230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.3359666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.3360092Z     T=4096,
2025-05-07T20:32:25.3360289Z     D=7168,
2025-05-07T20:32:25.3360498Z     scale_ub=None,
2025-05-07T20:32:25.3360731Z     contiguous=False,
2025-05-07T20:32:25.3360975Z     compiled=True,
2025-05-07T20:32:25.3361187Z )
2025-05-07T20:32:25.3361526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.3362044Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.3362330Z 
2025-05-07T20:32:25.3362439Z     @given(
2025-05-07T20:32:25.3362688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.3363020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.3363346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.3363702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.3364052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.3364359Z     )
2025-05-07T20:32:25.3364730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.3365195Z     def test_silu_mul_quant(
2025-05-07T20:32:25.3365455Z         self,
2025-05-07T20:32:25.3365664Z         T: int,
2025-05-07T20:32:25.3365871Z         D: int,
2025-05-07T20:32:25.3366105Z         scale_ub: Optional[float],
2025-05-07T20:32:25.3366398Z         contiguous: bool,
2025-05-07T20:32:25.3366653Z         compiled: bool,
2025-05-07T20:32:25.3366892Z     ) -> None:
2025-05-07T20:32:25.3367121Z         torch.manual_seed(2025)
2025-05-07T20:32:25.3367375Z     
2025-05-07T20:32:25.3367677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.3368039Z     
2025-05-07T20:32:25.3368242Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.3368553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.3368886Z         x = x_sign * x_clamp
2025-05-07T20:32:25.3369141Z         x0 = x[:, :D]
2025-05-07T20:32:25.3369381Z         x1 = x[:, D:]
2025-05-07T20:32:25.3369603Z     
2025-05-07T20:32:25.3369798Z         if contiguous:
2025-05-07T20:32:25.3370041Z             x0 = x0.contiguous()
2025-05-07T20:32:25.3370314Z             x1 = x1.contiguous()
2025-05-07T20:32:25.3370574Z     
2025-05-07T20:32:25.3370778Z         if scale_ub is not None:
2025-05-07T20:32:25.3371075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.3371435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.3371767Z             )
2025-05-07T20:32:25.3371972Z         else:
2025-05-07T20:32:25.3372287Z             scale_ub_tensor = None
2025-05-07T20:32:25.3372566Z     
2025-05-07T20:32:25.3372813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.3373151Z             op = silu_mul_quant
2025-05-07T20:32:25.3373428Z             if compiled:
2025-05-07T20:32:25.3373694Z                 op = torch.compile(op)
2025-05-07T20:32:25.3374017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3374320Z     
2025-05-07T20:32:25.3374526Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.3374706Z 
2025-05-07T20:32:25.3374812Z moe/activation_test.py:117: 
2025-05-07T20:32:25.3375130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3375482Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.3375781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.3376375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.3376976Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.3377671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.3378401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.3379081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.3379803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.3380499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.3381066Z     kernel = self.compile(
2025-05-07T20:32:25.3381641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.3382332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.3382757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.3383006Z 
2025-05-07T20:32:25.3383226Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676866140>
2025-05-07T20:32:25.3384358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.3385793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767698be20>}
2025-05-07T20:32:25.3387194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.3388315Z context = <triton._C.libtriton.ir.context object at 0x7f767680cb70>
2025-05-07T20:32:25.3388617Z 
2025-05-07T20:32:25.3388810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.3389366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.3389868Z                            module_map=module_map)
2025-05-07T20:32:25.3390257Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.3390637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.3390912Z E       ^
2025-05-07T20:32:25.3391406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.3391879Z 
2025-05-07T20:32:25.3392323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.3392857Z 
2025-05-07T20:32:25.7221598Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7222436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7223560Z     T=16384,
2025-05-07T20:32:25.7224298Z     D=5120,
2025-05-07T20:32:25.7224797Z     scale_ub=1200.0,
2025-05-07T20:32:25.7225224Z     contiguous=False,
2025-05-07T20:32:25.7225663Z     compiled=False,
2025-05-07T20:32:25.7226069Z )
2025-05-07T20:32:25.7226680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7227617Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.7227969Z 
2025-05-07T20:32:25.7228070Z     @given(
2025-05-07T20:32:25.7228331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7228675Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7229017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7229383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7229745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7230069Z     )
2025-05-07T20:32:25.7230465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7230954Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7231219Z         self,
2025-05-07T20:32:25.7231441Z         T: int,
2025-05-07T20:32:25.7231666Z         D: int,
2025-05-07T20:32:25.7232050Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7232353Z         contiguous: bool,
2025-05-07T20:32:25.7232624Z         compiled: bool,
2025-05-07T20:32:25.7232876Z     ) -> None:
2025-05-07T20:32:25.7233118Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7233392Z     
2025-05-07T20:32:25.7233767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7234144Z     
2025-05-07T20:32:25.7234364Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7234683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7235026Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7235297Z         x0 = x[:, :D]
2025-05-07T20:32:25.7235533Z         x1 = x[:, D:]
2025-05-07T20:32:25.7235766Z     
2025-05-07T20:32:25.7235979Z         if contiguous:
2025-05-07T20:32:25.7236234Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7236521Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7236790Z     
2025-05-07T20:32:25.7237016Z         if scale_ub is not None:
2025-05-07T20:32:25.7237317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7237683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7238026Z             )
2025-05-07T20:32:25.7238238Z         else:
2025-05-07T20:32:25.7238480Z             scale_ub_tensor = None
2025-05-07T20:32:25.7238762Z     
2025-05-07T20:32:25.7239018Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7239367Z             op = silu_mul_quant
2025-05-07T20:32:25.7239648Z             if compiled:
2025-05-07T20:32:25.7239926Z                 op = torch.compile(op)
2025-05-07T20:32:25.7240257Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7240568Z     
2025-05-07T20:32:25.7240787Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7240975Z 
2025-05-07T20:32:25.7241087Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7241416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7241785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7242098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7242857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7243612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7244197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7244947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7245675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7246395Z     kernel = self.compile(
2025-05-07T20:32:25.7246989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7247714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7248154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7248404Z 
2025-05-07T20:32:25.7248635Z self = <triton.compiler.compiler.ASTSource object at 0x7f767681f8e0>
2025-05-07T20:32:25.7249801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7251290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76768717e0>}
2025-05-07T20:32:25.7252756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7253951Z context = <triton._C.libtriton.ir.context object at 0x7f76768487b0>
2025-05-07T20:32:25.7254264Z 
2025-05-07T20:32:25.7254449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7255020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7255539Z                            module_map=module_map)
2025-05-07T20:32:25.7255941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7256323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7256613Z E       ^
2025-05-07T20:32:25.7257124Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7257640Z 
2025-05-07T20:32:25.7258122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7258685Z 
2025-05-07T20:32:25.7258800Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7259262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7259703Z     T=16384,
2025-05-07T20:32:25.7259918Z     D=5120,
2025-05-07T20:32:25.7260138Z     scale_ub=1200.0,
2025-05-07T20:32:25.7260392Z     contiguous=True,
2025-05-07T20:32:25.7260640Z     compiled=True,
2025-05-07T20:32:25.7260871Z )
2025-05-07T20:32:25.7261223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7261767Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.7262075Z 
2025-05-07T20:32:25.7262163Z     @given(
2025-05-07T20:32:25.7262421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7262775Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7263112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7263476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7263844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7264161Z     )
2025-05-07T20:32:25.7264554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7265041Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7265306Z         self,
2025-05-07T20:32:25.7265526Z         T: int,
2025-05-07T20:32:25.7265750Z         D: int,
2025-05-07T20:32:25.7265993Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7266296Z         contiguous: bool,
2025-05-07T20:32:25.7266566Z         compiled: bool,
2025-05-07T20:32:25.7266812Z     ) -> None:
2025-05-07T20:32:25.7267054Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7267325Z     
2025-05-07T20:32:25.7267621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7268088Z     
2025-05-07T20:32:25.7268308Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7268632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7268969Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7269241Z         x0 = x[:, :D]
2025-05-07T20:32:25.7269484Z         x1 = x[:, D:]
2025-05-07T20:32:25.7269712Z     
2025-05-07T20:32:25.7269919Z         if contiguous:
2025-05-07T20:32:25.7270181Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7270463Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7270730Z     
2025-05-07T20:32:25.7270947Z         if scale_ub is not None:
2025-05-07T20:32:25.7271248Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7271637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7271981Z             )
2025-05-07T20:32:25.7272198Z         else:
2025-05-07T20:32:25.7272428Z             scale_ub_tensor = None
2025-05-07T20:32:25.7272707Z     
2025-05-07T20:32:25.7272974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7273321Z             op = silu_mul_quant
2025-05-07T20:32:25.7273680Z             if compiled:
2025-05-07T20:32:25.7273959Z                 op = torch.compile(op)
2025-05-07T20:32:25.7274380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7274685Z     
2025-05-07T20:32:25.7274903Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7275087Z 
2025-05-07T20:32:25.7275197Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7275526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7275890Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7276206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7276813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.7277439Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.7278169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7278917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7279509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7280264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7280990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7281574Z     kernel = self.compile(
2025-05-07T20:32:25.7282166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7293824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7294271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7294522Z 
2025-05-07T20:32:25.7294759Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676831990>
2025-05-07T20:32:25.7295927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7297413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676871090>}
2025-05-07T20:32:25.7298915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7300029Z context = <triton._C.libtriton.ir.context object at 0x7f76766bc030>
2025-05-07T20:32:25.7300347Z 
2025-05-07T20:32:25.7300532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7301219Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7301741Z                            module_map=module_map)
2025-05-07T20:32:25.7302145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7302532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7302821Z E       ^
2025-05-07T20:32:25.7303332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7303826Z 
2025-05-07T20:32:25.7304276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7304837Z 
2025-05-07T20:32:25.9333194Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.9333694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.9334166Z     T=16384,
2025-05-07T20:32:25.9334407Z     D=5120,
2025-05-07T20:32:25.9334634Z     scale_ub=None,
2025-05-07T20:32:25.9334891Z     contiguous=False,
2025-05-07T20:32:25.9335157Z     compiled=True,
2025-05-07T20:32:25.9335394Z )
2025-05-07T20:32:25.9335760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.9336509Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.9336833Z 
2025-05-07T20:32:25.9336924Z     @given(
2025-05-07T20:32:25.9337199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.9337553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.9337906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.9338288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.9338663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.9338990Z     )
2025-05-07T20:32:25.9339396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.9339897Z     def test_silu_mul_quant(
2025-05-07T20:32:25.9340191Z         self,
2025-05-07T20:32:25.9340426Z         T: int,
2025-05-07T20:32:25.9340650Z         D: int,
2025-05-07T20:32:25.9340900Z         scale_ub: Optional[float],
2025-05-07T20:32:25.9341213Z         contiguous: bool,
2025-05-07T20:32:25.9341494Z         compiled: bool,
2025-05-07T20:32:25.9341754Z     ) -> None:
2025-05-07T20:32:25.9342003Z         torch.manual_seed(2025)
2025-05-07T20:32:25.9342300Z     
2025-05-07T20:32:25.9342610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.9343002Z     
2025-05-07T20:32:25.9343227Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.9343557Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.9343908Z         x = x_sign * x_clamp
2025-05-07T20:32:25.9344189Z         x0 = x[:, :D]
2025-05-07T20:32:25.9344435Z         x1 = x[:, D:]
2025-05-07T20:32:25.9344681Z     
2025-05-07T20:32:25.9344902Z         if contiguous:
2025-05-07T20:32:25.9345169Z             x0 = x0.contiguous()
2025-05-07T20:32:25.9345478Z             x1 = x1.contiguous()
2025-05-07T20:32:25.9345758Z     
2025-05-07T20:32:25.9345985Z         if scale_ub is not None:
2025-05-07T20:32:25.9346299Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.9346692Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.9347046Z             )
2025-05-07T20:32:25.9347261Z         else:
2025-05-07T20:32:25.9347505Z             scale_ub_tensor = None
2025-05-07T20:32:25.9347792Z     
2025-05-07T20:32:25.9348058Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.9348417Z             op = silu_mul_quant
2025-05-07T20:32:25.9348708Z             if compiled:
2025-05-07T20:32:25.9348994Z                 op = torch.compile(op)
2025-05-07T20:32:25.9349339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9349654Z     
2025-05-07T20:32:25.9349873Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.9350068Z 
2025-05-07T20:32:25.9350185Z moe/activation_test.py:117: 
2025-05-07T20:32:25.9350718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9351102Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.9351424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9352069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.9352706Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.9353454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.9354362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.9354974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.9355752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.9356502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.9357105Z     kernel = self.compile(
2025-05-07T20:32:25.9357754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.9358645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.9359106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9359371Z 
2025-05-07T20:32:25.9359606Z self = <triton.compiler.compiler.ASTSource object at 0x7f767661bc10>
2025-05-07T20:32:25.9360835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.9362392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676872290>}
2025-05-07T20:32:25.9363909Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.9365083Z context = <triton._C.libtriton.ir.context object at 0x7f767667e5f0>
2025-05-07T20:32:25.9365419Z 
2025-05-07T20:32:25.9365612Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.9366209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.9366741Z                            module_map=module_map)
2025-05-07T20:32:25.9367162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.9367596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.9367923Z E       ^
2025-05-07T20:32:25.9368461Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.9368980Z 
2025-05-07T20:32:25.9369452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.9370046Z 
2025-05-07T20:32:25.9370171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.9370640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.9371101Z     T=2048,
2025-05-07T20:32:25.9371323Z     D=5120,
2025-05-07T20:32:25.9371549Z     scale_ub=None,
2025-05-07T20:32:25.9371804Z     contiguous=False,
2025-05-07T20:32:25.9372070Z     compiled=True,
2025-05-07T20:32:25.9372302Z )
2025-05-07T20:32:26.0521013Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.0521633Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:26.0521954Z 
2025-05-07T20:32:26.0522044Z     @given(
2025-05-07T20:32:26.0522314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.0522838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.0523198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.0523579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.0524120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.0524475Z     )
2025-05-07T20:32:26.0524920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.0525492Z     def test_silu_mul_quant(
2025-05-07T20:32:26.0525773Z         self,
2025-05-07T20:32:26.0526001Z         T: int,
2025-05-07T20:32:26.0526237Z         D: int,
2025-05-07T20:32:26.0526484Z         scale_ub: Optional[float],
2025-05-07T20:32:26.0526801Z         contiguous: bool,
2025-05-07T20:32:26.0527081Z         compiled: bool,
2025-05-07T20:32:26.0527336Z     ) -> None:
2025-05-07T20:32:26.0527594Z         torch.manual_seed(2025)
2025-05-07T20:32:26.0527899Z     
2025-05-07T20:32:26.0528210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.0528597Z     
2025-05-07T20:32:26.0528820Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.0529146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.0529639Z         x = x_sign * x_clamp
2025-05-07T20:32:26.0529914Z         x0 = x[:, :D]
2025-05-07T20:32:26.0530159Z         x1 = x[:, D:]
2025-05-07T20:32:26.0530399Z     
2025-05-07T20:32:26.0530613Z         if contiguous:
2025-05-07T20:32:26.0530878Z             x0 = x0.contiguous()
2025-05-07T20:32:26.0531175Z             x1 = x1.contiguous()
2025-05-07T20:32:26.0531454Z     
2025-05-07T20:32:26.0531669Z         if scale_ub is not None:
2025-05-07T20:32:26.0531980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.0532363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.0532718Z             )
2025-05-07T20:32:26.0532937Z         else:
2025-05-07T20:32:26.0533185Z             scale_ub_tensor = None
2025-05-07T20:32:26.0533479Z     
2025-05-07T20:32:26.0533743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.0534100Z             op = silu_mul_quant
2025-05-07T20:32:26.0534388Z             if compiled:
2025-05-07T20:32:26.0534678Z                 op = torch.compile(op)
2025-05-07T20:32:26.0535016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.0535328Z     
2025-05-07T20:32:26.0535548Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.0535742Z 
2025-05-07T20:32:26.0535856Z moe/activation_test.py:117: 
2025-05-07T20:32:26.0536194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.0536571Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.0536888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.0537528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.0538208Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.0538954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.0539730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.0540345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.0541123Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.0541870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.0542473Z     kernel = self.compile(
2025-05-07T20:32:26.0543091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.0543826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.0544274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.0544536Z 
2025-05-07T20:32:26.0544898Z self = <triton.compiler.compiler.ASTSource object at 0x7f76767109d0>
2025-05-07T20:32:26.0546112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.0547657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676872170>}
2025-05-07T20:32:26.0549166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.0550317Z context = <triton._C.libtriton.ir.context object at 0x7f76767debf0>
2025-05-07T20:32:26.0550640Z 
2025-05-07T20:32:26.0550840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.0551426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.0551953Z                            module_map=module_map)
2025-05-07T20:32:26.0552481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.0552880Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.0553173Z E       ^
2025-05-07T20:32:26.0553804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.0554311Z 
2025-05-07T20:32:26.0554784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.0555356Z 
2025-05-07T20:32:26.0555482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.0555947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.0556401Z     T=2048,
2025-05-07T20:32:26.0556621Z     D=5120,
2025-05-07T20:32:26.0556841Z     scale_ub=1200.0,
2025-05-07T20:32:26.0557100Z     contiguous=False,
2025-05-07T20:32:26.0557366Z     compiled=True,
2025-05-07T20:32:26.0557595Z )
2025-05-07T20:32:26.0557964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.0558535Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.0558847Z 
2025-05-07T20:32:26.0558937Z     @given(
2025-05-07T20:32:26.0559203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.0559563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.0559918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.0560292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.0560672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.0561000Z     )
2025-05-07T20:32:26.0561397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.0561907Z     def test_silu_mul_quant(
2025-05-07T20:32:26.0562189Z         self,
2025-05-07T20:32:26.0562408Z         T: int,
2025-05-07T20:32:26.0562637Z         D: int,
2025-05-07T20:32:26.0562894Z         scale_ub: Optional[float],
2025-05-07T20:32:26.0563207Z         contiguous: bool,
2025-05-07T20:32:26.0563483Z         compiled: bool,
2025-05-07T20:32:26.0563743Z     ) -> None:
2025-05-07T20:32:26.0563987Z         torch.manual_seed(2025)
2025-05-07T20:32:26.0564266Z     
2025-05-07T20:32:26.0564579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.0564970Z     
2025-05-07T20:32:26.0565190Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.0565524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.0565879Z         x = x_sign * x_clamp
2025-05-07T20:32:26.0566150Z         x0 = x[:, :D]
2025-05-07T20:32:26.0566399Z         x1 = x[:, D:]
2025-05-07T20:32:26.0566639Z     
2025-05-07T20:32:26.0566848Z         if contiguous:
2025-05-07T20:32:26.0567211Z             x0 = x0.contiguous()
2025-05-07T20:32:26.0567515Z             x1 = x1.contiguous()
2025-05-07T20:32:26.0567833Z     
2025-05-07T20:32:26.0568057Z         if scale_ub is not None:
2025-05-07T20:32:26.0568378Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.0568757Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.0569112Z             )
2025-05-07T20:32:26.0569338Z         else:
2025-05-07T20:32:26.0569576Z             scale_ub_tensor = None
2025-05-07T20:32:26.0569866Z     
2025-05-07T20:32:26.0570134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.0570513Z             op = silu_mul_quant
2025-05-07T20:32:26.0570801Z             if compiled:
2025-05-07T20:32:26.0571091Z                 op = torch.compile(op)
2025-05-07T20:32:26.0571437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.0571746Z     
2025-05-07T20:32:26.0571972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.0572160Z 
2025-05-07T20:32:26.0572287Z moe/activation_test.py:117: 
2025-05-07T20:32:26.0572634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.0573007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.0573422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.0574056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.0574685Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.0575434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.0576217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.0576826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.0577597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.0578351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.0578955Z     kernel = self.compile(
2025-05-07T20:32:26.0579562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.0580311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.0580762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.0581020Z 
2025-05-07T20:32:26.0581259Z self = <triton.compiler.compiler.ASTSource object at 0x7f76767c74c0>
2025-05-07T20:32:26.0582464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.0584006Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676873880>}
2025-05-07T20:32:26.0585513Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.0586674Z context = <triton._C.libtriton.ir.context object at 0x7f76767200b0>
2025-05-07T20:32:26.0586998Z 
2025-05-07T20:32:26.0587193Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.0587778Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.0588308Z                            module_map=module_map)
2025-05-07T20:32:26.0588724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.0589120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.0589418Z E       ^
2025-05-07T20:32:26.0590032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.0590540Z 
2025-05-07T20:32:26.0591012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.0591593Z 
2025-05-07T20:32:26.2714876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.2715382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.2715887Z     T=4096,
2025-05-07T20:32:26.2716110Z     D=5120,
2025-05-07T20:32:26.2716334Z     scale_ub=1200.0,
2025-05-07T20:32:26.2716598Z     contiguous=True,
2025-05-07T20:32:26.2716860Z     compiled=True,
2025-05-07T20:32:26.2717106Z )
2025-05-07T20:32:26.2717474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.2718048Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:26.2718359Z 
2025-05-07T20:32:26.2718459Z     @given(
2025-05-07T20:32:26.2718732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.2719097Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.2719458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.2720023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.2720406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.2720738Z     )
2025-05-07T20:32:26.2721138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.2721649Z     def test_silu_mul_quant(
2025-05-07T20:32:26.2721932Z         self,
2025-05-07T20:32:26.2722160Z         T: int,
2025-05-07T20:32:26.2722387Z         D: int,
2025-05-07T20:32:26.2722643Z         scale_ub: Optional[float],
2025-05-07T20:32:26.2722958Z         contiguous: bool,
2025-05-07T20:32:26.2723236Z         compiled: bool,
2025-05-07T20:32:26.2723500Z     ) -> None:
2025-05-07T20:32:26.2723918Z         torch.manual_seed(2025)
2025-05-07T20:32:26.2724209Z     
2025-05-07T20:32:26.2724526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.2724924Z     
2025-05-07T20:32:26.2725147Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.2725494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.2725855Z         x = x_sign * x_clamp
2025-05-07T20:32:26.2726130Z         x0 = x[:, :D]
2025-05-07T20:32:26.2726382Z         x1 = x[:, D:]
2025-05-07T20:32:26.2726629Z     
2025-05-07T20:32:26.2726839Z         if contiguous:
2025-05-07T20:32:26.2727107Z             x0 = x0.contiguous()
2025-05-07T20:32:26.2727404Z             x1 = x1.contiguous()
2025-05-07T20:32:26.2727686Z     
2025-05-07T20:32:26.2727908Z         if scale_ub is not None:
2025-05-07T20:32:26.2728227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.2728614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.2728966Z             )
2025-05-07T20:32:26.2729194Z         else:
2025-05-07T20:32:26.2729447Z             scale_ub_tensor = None
2025-05-07T20:32:26.2729738Z     
2025-05-07T20:32:26.2730005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.2730373Z             op = silu_mul_quant
2025-05-07T20:32:26.2730665Z             if compiled:
2025-05-07T20:32:26.2730956Z                 op = torch.compile(op)
2025-05-07T20:32:26.2731305Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2731618Z     
2025-05-07T20:32:26.2731844Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.2732035Z 
2025-05-07T20:32:26.2732156Z moe/activation_test.py:117: 
2025-05-07T20:32:26.2732504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2732881Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.2733208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2733856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.2734635Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.2735396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.2736189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.2736814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.2737598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.2738366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.2738979Z     kernel = self.compile(
2025-05-07T20:32:26.2739598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.2740359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.2740822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2741083Z 
2025-05-07T20:32:26.2741327Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676757820>
2025-05-07T20:32:26.2742554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.2744257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f8940>}
2025-05-07T20:32:26.2745791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.2746966Z context = <triton._C.libtriton.ir.context object at 0x7f76763255f0>
2025-05-07T20:32:26.2747296Z 
2025-05-07T20:32:26.2747499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.2748142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.2748685Z                            module_map=module_map)
2025-05-07T20:32:26.2749103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.2749503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.2749806Z E       ^
2025-05-07T20:32:26.2750348Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.2750863Z 
2025-05-07T20:32:26.2751347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.2751935Z 
2025-05-07T20:32:26.2752055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.2752530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.2752997Z     T=128,
2025-05-07T20:32:26.2753213Z     D=5120,
2025-05-07T20:32:26.2753440Z     scale_ub=1200.0,
2025-05-07T20:32:26.2753790Z     contiguous=False,
2025-05-07T20:32:26.2754047Z     compiled=True,
2025-05-07T20:32:26.2754288Z )
2025-05-07T20:32:26.4019673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.4020313Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.4020631Z 
2025-05-07T20:32:26.4020726Z     @given(
2025-05-07T20:32:26.4021004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.4021370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.4028534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.4028976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.4029401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.4029738Z     )
2025-05-07T20:32:26.4030335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.4030852Z     def test_silu_mul_quant(
2025-05-07T20:32:26.4031131Z         self,
2025-05-07T20:32:26.4031361Z         T: int,
2025-05-07T20:32:26.4031599Z         D: int,
2025-05-07T20:32:26.4031852Z         scale_ub: Optional[float],
2025-05-07T20:32:26.4032176Z         contiguous: bool,
2025-05-07T20:32:26.4032461Z         compiled: bool,
2025-05-07T20:32:26.4032723Z     ) -> None:
2025-05-07T20:32:26.4032976Z         torch.manual_seed(2025)
2025-05-07T20:32:26.4033259Z     
2025-05-07T20:32:26.4033645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.4034087Z     
2025-05-07T20:32:26.4034320Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.4034675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.4035068Z         x = x_sign * x_clamp
2025-05-07T20:32:26.4035361Z         x0 = x[:, :D]
2025-05-07T20:32:26.4035618Z         x1 = x[:, D:]
2025-05-07T20:32:26.4035878Z     
2025-05-07T20:32:26.4036104Z         if contiguous:
2025-05-07T20:32:26.4036391Z             x0 = x0.contiguous()
2025-05-07T20:32:26.4036708Z             x1 = x1.contiguous()
2025-05-07T20:32:26.4037024Z     
2025-05-07T20:32:26.4037400Z         if scale_ub is not None:
2025-05-07T20:32:26.4037761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.4038209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.4038607Z             )
2025-05-07T20:32:26.4038833Z         else:
2025-05-07T20:32:26.4039090Z             scale_ub_tensor = None
2025-05-07T20:32:26.4039403Z     
2025-05-07T20:32:26.4039681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.4040077Z             op = silu_mul_quant
2025-05-07T20:32:26.4040388Z             if compiled:
2025-05-07T20:32:26.4040685Z                 op = torch.compile(op)
2025-05-07T20:32:26.4041060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4041407Z     
2025-05-07T20:32:26.4041639Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.4041850Z 
2025-05-07T20:32:26.4041967Z moe/activation_test.py:117: 
2025-05-07T20:32:26.4042317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4042713Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.4043035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4043696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.4044346Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.4045108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.4045896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.4046516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.4047307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.4048065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.4048681Z     kernel = self.compile(
2025-05-07T20:32:26.4049312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.4050069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.4050522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4050790Z 
2025-05-07T20:32:26.4051025Z self = <triton.compiler.compiler.ASTSource object at 0x7f767633b7f0>
2025-05-07T20:32:26.4052264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.4053932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f91b0>}
2025-05-07T20:32:26.4055473Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.4056651Z context = <triton._C.libtriton.ir.context object at 0x7f76763ee9b0>
2025-05-07T20:32:26.4056986Z 
2025-05-07T20:32:26.4057178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.4057782Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.4058325Z                            module_map=module_map)
2025-05-07T20:32:26.4058742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.4059149Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.4059459Z E       ^
2025-05-07T20:32:26.4059993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.4060514Z 
2025-05-07T20:32:26.4060991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.4061664Z 
2025-05-07T20:32:26.4061792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.4062268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.4062726Z     T=16384,
2025-05-07T20:32:26.4062952Z     D=7168,
2025-05-07T20:32:26.4063180Z     scale_ub=1200.0,
2025-05-07T20:32:26.4063434Z     contiguous=True,
2025-05-07T20:32:26.4063694Z     compiled=True,
2025-05-07T20:32:26.4063928Z )
2025-05-07T20:32:26.4064294Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.4064871Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:26.4065197Z 
2025-05-07T20:32:26.4065294Z     @given(
2025-05-07T20:32:26.4065560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.4065925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.4066291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.4066676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.4067057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.4067389Z     )
2025-05-07T20:32:26.4067800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.4068353Z     def test_silu_mul_quant(
2025-05-07T20:32:26.4068637Z         self,
2025-05-07T20:32:26.4068864Z         T: int,
2025-05-07T20:32:26.4069090Z         D: int,
2025-05-07T20:32:26.4069347Z         scale_ub: Optional[float],
2025-05-07T20:32:26.4069665Z         contiguous: bool,
2025-05-07T20:32:26.4069939Z         compiled: bool,
2025-05-07T20:32:26.4070200Z     ) -> None:
2025-05-07T20:32:26.4070453Z         torch.manual_seed(2025)
2025-05-07T20:32:26.4070727Z     
2025-05-07T20:32:26.4071042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.4071435Z     
2025-05-07T20:32:26.4071663Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.4072006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.4072371Z         x = x_sign * x_clamp
2025-05-07T20:32:26.4072654Z         x0 = x[:, :D]
2025-05-07T20:32:26.4072904Z         x1 = x[:, D:]
2025-05-07T20:32:26.4073147Z     
2025-05-07T20:32:26.4073364Z         if contiguous:
2025-05-07T20:32:26.4073708Z             x0 = x0.contiguous()
2025-05-07T20:32:26.4074006Z             x1 = x1.contiguous()
2025-05-07T20:32:26.4074291Z     
2025-05-07T20:32:26.4074511Z         if scale_ub is not None:
2025-05-07T20:32:26.4074829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.4075215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.4075663Z             )
2025-05-07T20:32:26.4075893Z         else:
2025-05-07T20:32:26.4076141Z             scale_ub_tensor = None
2025-05-07T20:32:26.4076427Z     
2025-05-07T20:32:26.4076696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.4077065Z             op = silu_mul_quant
2025-05-07T20:32:26.4077359Z             if compiled:
2025-05-07T20:32:26.4077648Z                 op = torch.compile(op)
2025-05-07T20:32:26.4077993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4078312Z     
2025-05-07T20:32:26.4078533Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.4078732Z 
2025-05-07T20:32:26.4078846Z moe/activation_test.py:117: 
2025-05-07T20:32:26.4079187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4079563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.4079889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4080536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.4081177Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.4081937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.4082813Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.4083433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.4084218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.4084980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.4085593Z     kernel = self.compile(
2025-05-07T20:32:26.4086220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.4086969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.4087431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4087695Z 
2025-05-07T20:32:26.4087938Z self = <triton.compiler.compiler.ASTSource object at 0x7f76763e8370>
2025-05-07T20:32:26.4089179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.4090746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f97e0>}
2025-05-07T20:32:26.4092283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.4093471Z context = <triton._C.libtriton.ir.context object at 0x7f76765bbe70>
2025-05-07T20:32:26.4093805Z 
2025-05-07T20:32:26.4094005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.4094599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.4095149Z                            module_map=module_map)
2025-05-07T20:32:26.4095568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.4095976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.4096274Z E       ^
2025-05-07T20:32:26.4096812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.4097329Z 
2025-05-07T20:32:26.4097816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.4098403Z 
2025-05-07T20:32:26.7704826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7705569Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7706036Z     T=16384,
2025-05-07T20:32:26.7706258Z     D=5120,
2025-05-07T20:32:26.7706482Z     scale_ub=1200.0,
2025-05-07T20:32:26.7706742Z     contiguous=True,
2025-05-07T20:32:26.7706997Z     compiled=False,
2025-05-07T20:32:26.7707237Z )
2025-05-07T20:32:26.7707598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7708165Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:26.7708480Z 
2025-05-07T20:32:26.7708574Z     @given(
2025-05-07T20:32:26.7708833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7709189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7709540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7709915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7710280Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7710613Z     )
2025-05-07T20:32:26.7711008Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7711501Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7711778Z         self,
2025-05-07T20:32:26.7712162Z         T: int,
2025-05-07T20:32:26.7712386Z         D: int,
2025-05-07T20:32:26.7712637Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7712945Z         contiguous: bool,
2025-05-07T20:32:26.7713215Z         compiled: bool,
2025-05-07T20:32:26.7713474Z     ) -> None:
2025-05-07T20:32:26.7713788Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7714057Z     
2025-05-07T20:32:26.7714370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7714763Z     
2025-05-07T20:32:26.7714980Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7715313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7715663Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7715937Z         x0 = x[:, :D]
2025-05-07T20:32:26.7716186Z         x1 = x[:, D:]
2025-05-07T20:32:26.7716425Z     
2025-05-07T20:32:26.7716651Z         if contiguous:
2025-05-07T20:32:26.7716912Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7717217Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7717494Z     
2025-05-07T20:32:26.7717714Z         if scale_ub is not None:
2025-05-07T20:32:26.7718030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7718412Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7718759Z             )
2025-05-07T20:32:26.7718983Z         else:
2025-05-07T20:32:26.7719226Z             scale_ub_tensor = None
2025-05-07T20:32:26.7719510Z     
2025-05-07T20:32:26.7719779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7720137Z             op = silu_mul_quant
2025-05-07T20:32:26.7720423Z             if compiled:
2025-05-07T20:32:26.7720709Z                 op = torch.compile(op)
2025-05-07T20:32:26.7721054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7721365Z     
2025-05-07T20:32:26.7721591Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7721785Z 
2025-05-07T20:32:26.7721899Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7722241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7722613Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7722936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7723716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7724664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7725274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7726052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7726941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7727543Z     kernel = self.compile(
2025-05-07T20:32:26.7728158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7728907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7729357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7729623Z 
2025-05-07T20:32:26.7729857Z self = <triton.compiler.compiler.ASTSource object at 0x7f767659fcd0>
2025-05-07T20:32:26.7731070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7732618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767fa950>}
2025-05-07T20:32:26.7734126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7735396Z context = <triton._C.libtriton.ir.context object at 0x7f76765ffdf0>
2025-05-07T20:32:26.7735725Z 
2025-05-07T20:32:26.7735916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7736510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7737044Z                            module_map=module_map)
2025-05-07T20:32:26.7737457Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7737858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7738160Z E       ^
2025-05-07T20:32:26.7738685Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7739197Z 
2025-05-07T20:32:26.7739662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7740248Z 
2025-05-07T20:32:26.7740368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7740862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7741322Z     T=1,
2025-05-07T20:32:26.7741530Z     D=7168,
2025-05-07T20:32:26.7741759Z     scale_ub=1200.0,
2025-05-07T20:32:26.7742023Z     contiguous=False,
2025-05-07T20:32:26.7742280Z     compiled=False,
2025-05-07T20:32:26.7742520Z )
2025-05-07T20:32:26.7742885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7743434Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.7743745Z 
2025-05-07T20:32:26.7743835Z     @given(
2025-05-07T20:32:26.7744111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7744470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7744818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7745204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7745590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7745919Z     )
2025-05-07T20:32:26.7746322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7746826Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7747100Z         self,
2025-05-07T20:32:26.7747329Z         T: int,
2025-05-07T20:32:26.7747560Z         D: int,
2025-05-07T20:32:26.7747807Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7748122Z         contiguous: bool,
2025-05-07T20:32:26.7748400Z         compiled: bool,
2025-05-07T20:32:26.7748654Z     ) -> None:
2025-05-07T20:32:26.7748904Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7749179Z     
2025-05-07T20:32:26.7749634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7750025Z     
2025-05-07T20:32:26.7750249Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7750585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7750940Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7751215Z         x0 = x[:, :D]
2025-05-07T20:32:26.7751467Z         x1 = x[:, D:]
2025-05-07T20:32:26.7751704Z     
2025-05-07T20:32:26.7751922Z         if contiguous:
2025-05-07T20:32:26.7752193Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7752487Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7752762Z     
2025-05-07T20:32:26.7752987Z         if scale_ub is not None:
2025-05-07T20:32:26.7753297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7753748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7754105Z             )
2025-05-07T20:32:26.7754324Z         else:
2025-05-07T20:32:26.7754573Z             scale_ub_tensor = None
2025-05-07T20:32:26.7754870Z     
2025-05-07T20:32:26.7755133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7755492Z             op = silu_mul_quant
2025-05-07T20:32:26.7755780Z             if compiled:
2025-05-07T20:32:26.7756168Z                 op = torch.compile(op)
2025-05-07T20:32:26.7756502Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7756819Z     
2025-05-07T20:32:26.7757046Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7757234Z 
2025-05-07T20:32:26.7757348Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7757693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7758080Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7758400Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7759186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7759967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7760583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7761352Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7762114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7762723Z     kernel = self.compile(
2025-05-07T20:32:26.7763331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7764074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7764529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7764786Z 
2025-05-07T20:32:26.7765025Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762c1ed0>
2025-05-07T20:32:26.7766234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7767769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767fbac0>}
2025-05-07T20:32:26.7769278Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7770429Z context = <triton._C.libtriton.ir.context object at 0x7f767657a2b0>
2025-05-07T20:32:26.7770754Z 
2025-05-07T20:32:26.7770952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7771537Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7772162Z                            module_map=module_map)
2025-05-07T20:32:26.7772581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7772981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7773280Z E       ^
2025-05-07T20:32:26.7773812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7774320Z 
2025-05-07T20:32:26.7774793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7775368Z 
2025-05-07T20:32:26.9906810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9908127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9908904Z     T=4096,
2025-05-07T20:32:26.9909232Z     D=7168,
2025-05-07T20:32:26.9909537Z     scale_ub=1200.0,
2025-05-07T20:32:26.9909812Z     contiguous=False,
2025-05-07T20:32:26.9910082Z     compiled=True,
2025-05-07T20:32:26.9910321Z )
2025-05-07T20:32:26.9910711Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9911293Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.9911836Z 
2025-05-07T20:32:26.9911930Z     @given(
2025-05-07T20:32:26.9912205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9912574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9912929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9913318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9913777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9914113Z     )
2025-05-07T20:32:26.9914519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9915039Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9915335Z         self,
2025-05-07T20:32:26.9915562Z         T: int,
2025-05-07T20:32:26.9915799Z         D: int,
2025-05-07T20:32:26.9916066Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9916381Z         contiguous: bool,
2025-05-07T20:32:26.9916669Z         compiled: bool,
2025-05-07T20:32:26.9916936Z     ) -> None:
2025-05-07T20:32:26.9917192Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9917480Z     
2025-05-07T20:32:26.9917807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9918203Z     
2025-05-07T20:32:26.9918437Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9918778Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9919145Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9919421Z         x0 = x[:, :D]
2025-05-07T20:32:26.9919678Z         x1 = x[:, D:]
2025-05-07T20:32:26.9919927Z     
2025-05-07T20:32:26.9920149Z         if contiguous:
2025-05-07T20:32:26.9920423Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9920730Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9921006Z     
2025-05-07T20:32:26.9921233Z         if scale_ub is not None:
2025-05-07T20:32:26.9921553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9921938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9922295Z             )
2025-05-07T20:32:26.9922534Z         else:
2025-05-07T20:32:26.9922776Z             scale_ub_tensor = None
2025-05-07T20:32:26.9923071Z     
2025-05-07T20:32:26.9923346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9923705Z             op = silu_mul_quant
2025-05-07T20:32:26.9924371Z             if compiled:
2025-05-07T20:32:26.9924666Z                 op = torch.compile(op)
2025-05-07T20:32:26.9925009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9925335Z     
2025-05-07T20:32:26.9925563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.9925755Z 
2025-05-07T20:32:26.9925879Z moe/activation_test.py:117: 
2025-05-07T20:32:26.9926222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9933283Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.9933640Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9934295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.9934950Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.9935717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.9936511Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.9937124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9937905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9938667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9939273Z     kernel = self.compile(
2025-05-07T20:32:26.9939916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9940671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9941264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9941533Z 
2025-05-07T20:32:26.9941775Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762407f0>
2025-05-07T20:32:26.9943347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9944915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d4550>}
2025-05-07T20:32:26.9946444Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9947604Z context = <triton._C.libtriton.ir.context object at 0x7f76762e93f0>
2025-05-07T20:32:26.9947943Z 
2025-05-07T20:32:26.9948137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9948732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9949271Z                            module_map=module_map)
2025-05-07T20:32:26.9949686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9950091Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.9950393Z E       ^
2025-05-07T20:32:26.9950925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9951444Z 
2025-05-07T20:32:26.9951921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.9952512Z 
2025-05-07T20:32:26.9952632Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9953111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9953665Z     T=128,
2025-05-07T20:32:26.9953883Z     D=7168,
2025-05-07T20:32:26.9954113Z     scale_ub=1200.0,
2025-05-07T20:32:26.9954366Z     contiguous=False,
2025-05-07T20:32:26.9954633Z     compiled=True,
2025-05-07T20:32:26.9954868Z )
2025-05-07T20:32:27.1088563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.1089399Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:27.1089834Z 
2025-05-07T20:32:27.1089963Z     @given(
2025-05-07T20:32:27.1090338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.1090852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.1091541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.1091986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.1092365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.1092698Z     )
2025-05-07T20:32:27.1093096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.1093597Z     def test_silu_mul_quant(
2025-05-07T20:32:27.1093876Z         self,
2025-05-07T20:32:27.1094098Z         T: int,
2025-05-07T20:32:27.1094324Z         D: int,
2025-05-07T20:32:27.1094575Z         scale_ub: Optional[float],
2025-05-07T20:32:27.1094879Z         contiguous: bool,
2025-05-07T20:32:27.1095159Z         compiled: bool,
2025-05-07T20:32:27.1095422Z     ) -> None:
2025-05-07T20:32:27.1095663Z         torch.manual_seed(2025)
2025-05-07T20:32:27.1095940Z     
2025-05-07T20:32:27.1096255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.1096640Z     
2025-05-07T20:32:27.1096867Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.1097200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.1097555Z         x = x_sign * x_clamp
2025-05-07T20:32:27.1097842Z         x0 = x[:, :D]
2025-05-07T20:32:27.1098274Z         x1 = x[:, D:]
2025-05-07T20:32:27.1098514Z     
2025-05-07T20:32:27.1098723Z         if contiguous:
2025-05-07T20:32:27.1098991Z             x0 = x0.contiguous()
2025-05-07T20:32:27.1099287Z             x1 = x1.contiguous()
2025-05-07T20:32:27.1099557Z     
2025-05-07T20:32:27.1099778Z         if scale_ub is not None:
2025-05-07T20:32:27.1100090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.1100469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.1100820Z             )
2025-05-07T20:32:27.1101040Z         else:
2025-05-07T20:32:27.1101275Z             scale_ub_tensor = None
2025-05-07T20:32:27.1101562Z     
2025-05-07T20:32:27.1101831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.1102186Z             op = silu_mul_quant
2025-05-07T20:32:27.1102475Z             if compiled:
2025-05-07T20:32:27.1102760Z                 op = torch.compile(op)
2025-05-07T20:32:27.1103101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1103418Z     
2025-05-07T20:32:27.1103641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.1103830Z 
2025-05-07T20:32:27.1103950Z moe/activation_test.py:117: 
2025-05-07T20:32:27.1104284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1104664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.1104992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1105627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.1106266Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.1107019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.1107805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.1108409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.1109193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.1109947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.1110548Z     kernel = self.compile(
2025-05-07T20:32:27.1111165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.1111912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.1112359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1112620Z 
2025-05-07T20:32:27.1112852Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762423e0>
2025-05-07T20:32:27.1114267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.1115856Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d4f70>}
2025-05-07T20:32:27.1117388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.1118556Z context = <triton._C.libtriton.ir.context object at 0x7f76762065b0>
2025-05-07T20:32:27.1118887Z 
2025-05-07T20:32:27.1119080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.1119682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.1120221Z                            module_map=module_map)
2025-05-07T20:32:27.1120636Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.1121039Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.1121462Z E       ^
2025-05-07T20:32:27.1121997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.1122512Z 
2025-05-07T20:32:27.1122990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.1123577Z 
2025-05-07T20:32:27.1123699Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.1124557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.1125018Z     T=2048,
2025-05-07T20:32:27.1125230Z     D=7168,
2025-05-07T20:32:27.1125453Z     scale_ub=None,
2025-05-07T20:32:27.1125706Z     contiguous=True,
2025-05-07T20:32:27.1125976Z     compiled=True,
2025-05-07T20:32:27.1126204Z )
2025-05-07T20:32:27.1126571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.1127131Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.1127441Z 
2025-05-07T20:32:27.1127530Z     @given(
2025-05-07T20:32:27.1127801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.1128159Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.1128509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.1128887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.1129266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.1129591Z     )
2025-05-07T20:32:27.1129995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.1130502Z     def test_silu_mul_quant(
2025-05-07T20:32:27.1130782Z         self,
2025-05-07T20:32:27.1131005Z         T: int,
2025-05-07T20:32:27.1131238Z         D: int,
2025-05-07T20:32:27.1131494Z         scale_ub: Optional[float],
2025-05-07T20:32:27.1131802Z         contiguous: bool,
2025-05-07T20:32:27.1132081Z         compiled: bool,
2025-05-07T20:32:27.1132345Z     ) -> None:
2025-05-07T20:32:27.1132589Z         torch.manual_seed(2025)
2025-05-07T20:32:27.1132866Z     
2025-05-07T20:32:27.1133180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.1133565Z     
2025-05-07T20:32:27.1133790Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.1134125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.1134481Z         x = x_sign * x_clamp
2025-05-07T20:32:27.1134764Z         x0 = x[:, :D]
2025-05-07T20:32:27.1135015Z         x1 = x[:, D:]
2025-05-07T20:32:27.1135252Z     
2025-05-07T20:32:27.1135469Z         if contiguous:
2025-05-07T20:32:27.1135739Z             x0 = x0.contiguous()
2025-05-07T20:32:27.1136036Z             x1 = x1.contiguous()
2025-05-07T20:32:27.1136461Z     
2025-05-07T20:32:27.1136696Z         if scale_ub is not None:
2025-05-07T20:32:27.1137012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.1137392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.1137749Z             )
2025-05-07T20:32:27.1137973Z         else:
2025-05-07T20:32:27.1138212Z             scale_ub_tensor = None
2025-05-07T20:32:27.1138501Z     
2025-05-07T20:32:27.1138768Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.1139122Z             op = silu_mul_quant
2025-05-07T20:32:27.1139414Z             if compiled:
2025-05-07T20:32:27.1139702Z                 op = torch.compile(op)
2025-05-07T20:32:27.1140038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1140355Z     
2025-05-07T20:32:27.1140578Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.1140768Z 
2025-05-07T20:32:27.1140884Z moe/activation_test.py:117: 
2025-05-07T20:32:27.1141234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1141615Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.1141939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1142570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.1143374Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.1144125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.1144903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.1145518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.1146299Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.1147056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.1147668Z     kernel = self.compile(
2025-05-07T20:32:27.1148336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.1149092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.1149556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1149821Z 
2025-05-07T20:32:27.1150058Z self = <triton.compiler.compiler.ASTSource object at 0x7f767623ce80>
2025-05-07T20:32:27.1151281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.1152836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d5bd0>}
2025-05-07T20:32:27.1154496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.1155797Z context = <triton._C.libtriton.ir.context object at 0x7f767644bbf0>
2025-05-07T20:32:27.1156130Z 
2025-05-07T20:32:27.1156323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.1156922Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.1157458Z                            module_map=module_map)
2025-05-07T20:32:27.1157877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.1158337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.1158642Z E       ^
2025-05-07T20:32:27.1159171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.1159690Z 
2025-05-07T20:32:27.1160267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.1160858Z 
2025-05-07T20:32:27.1996562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.1997125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.1997799Z     T=16384,
2025-05-07T20:32:27.1998100Z     D=5120,
2025-05-07T20:32:27.1998380Z     scale_ub=None,
2025-05-07T20:32:27.1998676Z     contiguous=False,
2025-05-07T20:32:27.1998937Z     compiled=False,
2025-05-07T20:32:27.1999177Z )
2025-05-07T20:32:27.1999550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2000124Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.2000453Z 
2025-05-07T20:32:27.2000544Z     @given(
2025-05-07T20:32:27.2000818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2001186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2001545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2001928Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2002310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2002856Z     )
2025-05-07T20:32:27.2003265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2003778Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2004054Z         self,
2025-05-07T20:32:27.2004287Z         T: int,
2025-05-07T20:32:27.2004520Z         D: int,
2025-05-07T20:32:27.2004772Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2005091Z         contiguous: bool,
2025-05-07T20:32:27.2005371Z         compiled: bool,
2025-05-07T20:32:27.2005630Z     ) -> None:
2025-05-07T20:32:27.2005886Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2006167Z     
2025-05-07T20:32:27.2006477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2006879Z     
2025-05-07T20:32:27.2007110Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2007453Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2009780Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.2011946Z 
2025-05-07T20:32:27.2012087Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:27.2012359Z 
2025-05-07T20:32:27.2012487Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2012961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2013425Z     T=4096,
2025-05-07T20:32:27.2013645Z     D=7168,
2025-05-07T20:32:27.2013865Z     scale_ub=1200.0,
2025-05-07T20:32:27.2014126Z     contiguous=True,
2025-05-07T20:32:27.2014391Z     compiled=True,
2025-05-07T20:32:27.2014621Z )
2025-05-07T20:32:27.2014985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2015547Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:27.2015854Z 
2025-05-07T20:32:27.2015953Z     @given(
2025-05-07T20:32:27.2016210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2016571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2016925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2017297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2017676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2018141Z     )
2025-05-07T20:32:27.2018548Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2019055Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2019334Z         self,
2025-05-07T20:32:27.2019559Z         T: int,
2025-05-07T20:32:27.2019791Z         D: int,
2025-05-07T20:32:27.2020043Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2020358Z         contiguous: bool,
2025-05-07T20:32:27.2020630Z         compiled: bool,
2025-05-07T20:32:27.2020890Z     ) -> None:
2025-05-07T20:32:27.2021142Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2021414Z     
2025-05-07T20:32:27.2021725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2022117Z     
2025-05-07T20:32:27.2022332Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2022665Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2025307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.2027602Z 
2025-05-07T20:32:27.2027747Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:27.2027991Z 
2025-05-07T20:32:27.2028119Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2028593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2029053Z     T=16384,
2025-05-07T20:32:27.2029277Z     D=7168,
2025-05-07T20:32:27.2029492Z     scale_ub=None,
2025-05-07T20:32:27.2029744Z     contiguous=False,
2025-05-07T20:32:27.2030007Z     compiled=False,
2025-05-07T20:32:27.2030241Z )
2025-05-07T20:32:27.2030607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2031180Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.2031504Z 
2025-05-07T20:32:27.2031595Z     @given(
2025-05-07T20:32:27.2031858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2032244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2032601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2032974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2033356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2033784Z     )
2025-05-07T20:32:27.2034185Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2034696Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2034976Z         self,
2025-05-07T20:32:27.2035193Z         T: int,
2025-05-07T20:32:27.2035427Z         D: int,
2025-05-07T20:32:27.2035681Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2035987Z         contiguous: bool,
2025-05-07T20:32:27.2036269Z         compiled: bool,
2025-05-07T20:32:27.2036530Z     ) -> None:
2025-05-07T20:32:27.2036781Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2037059Z     
2025-05-07T20:32:27.2037378Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2039971Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.2042141Z 
2025-05-07T20:32:27.2042284Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.2042528Z 
2025-05-07T20:32:27.2042649Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2043136Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2043600Z     T=2048,
2025-05-07T20:32:27.2043813Z     D=7168,
2025-05-07T20:32:27.2044038Z     scale_ub=1200.0,
2025-05-07T20:32:27.2044299Z     contiguous=True,
2025-05-07T20:32:27.2044549Z     compiled=True,
2025-05-07T20:32:27.2044789Z )
2025-05-07T20:32:27.2045154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.2045722Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:27.2046031Z 
2025-05-07T20:32:27.2046120Z     @given(
2025-05-07T20:32:27.2046385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.2046743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.2047099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.2047480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.2047859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.2048283Z     )
2025-05-07T20:32:27.2048688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.2049202Z     def test_silu_mul_quant(
2025-05-07T20:32:27.2049482Z         self,
2025-05-07T20:32:27.2049701Z         T: int,
2025-05-07T20:32:27.2049933Z         D: int,
2025-05-07T20:32:27.2050186Z         scale_ub: Optional[float],
2025-05-07T20:32:27.2050494Z         contiguous: bool,
2025-05-07T20:32:27.2050772Z         compiled: bool,
2025-05-07T20:32:27.2051031Z     ) -> None:
2025-05-07T20:32:27.2051273Z         torch.manual_seed(2025)
2025-05-07T20:32:27.2051552Z     
2025-05-07T20:32:27.2051868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.2052259Z     
2025-05-07T20:32:27.2052493Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.2052832Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.2055132Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.2057246Z 
2025-05-07T20:32:27.2057387Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:27.2057631Z 
2025-05-07T20:32:27.2057758Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.2058257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.2058949Z     T=2048,
2025-05-07T20:32:27.2059164Z     D=7168,
2025-05-07T20:32:27.2059388Z     scale_ub=None,
2025-05-07T20:32:27.2059635Z     contiguous=True,
2025-05-07T20:32:27.2059895Z     compiled=False,
2025-05-07T20:32:27.2060133Z )
2025-05-07T20:32:27.3477225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.3477848Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.3478279Z 
2025-05-07T20:32:27.3478395Z     @given(
2025-05-07T20:32:27.3478658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.3479021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.3479372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.3479751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.3480121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.3480447Z     )
2025-05-07T20:32:27.3481065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.3481571Z     def test_silu_mul_quant(
2025-05-07T20:32:27.3481850Z         self,
2025-05-07T20:32:27.3482077Z         T: int,
2025-05-07T20:32:27.3482304Z         D: int,
2025-05-07T20:32:27.3482556Z         scale_ub: Optional[float],
2025-05-07T20:32:27.3482865Z         contiguous: bool,
2025-05-07T20:32:27.3483130Z         compiled: bool,
2025-05-07T20:32:27.3483386Z     ) -> None:
2025-05-07T20:32:27.3483631Z         torch.manual_seed(2025)
2025-05-07T20:32:27.3483899Z     
2025-05-07T20:32:27.3484211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.3484600Z     
2025-05-07T20:32:27.3484820Z >       x_sign = torch.sign(x)
2025-05-07T20:32:27.3487037Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.3489292Z 
2025-05-07T20:32:27.3489427Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:27.3489674Z 
2025-05-07T20:32:27.3489792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.3490266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.3490718Z     T=1,
2025-05-07T20:32:27.3490929Z     D=7168,
2025-05-07T20:32:27.3491150Z     scale_ub=1200.0,
2025-05-07T20:32:27.3491398Z     contiguous=True,
2025-05-07T20:32:27.3491651Z     compiled=False,
2025-05-07T20:32:27.3491890Z )
2025-05-07T20:32:27.3498325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.3498900Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.3499206Z 
2025-05-07T20:32:27.3499297Z     @given(
2025-05-07T20:32:27.3499562Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.3499928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.3500275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.3500646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.3501017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.3501339Z     )
2025-05-07T20:32:27.3501735Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.3502235Z     def test_silu_mul_quant(
2025-05-07T20:32:27.3502515Z         self,
2025-05-07T20:32:27.3502734Z         T: int,
2025-05-07T20:32:27.3502957Z         D: int,
2025-05-07T20:32:27.3503207Z         scale_ub: Optional[float],
2025-05-07T20:32:27.3503515Z         contiguous: bool,
2025-05-07T20:32:27.3503785Z         compiled: bool,
2025-05-07T20:32:27.3504036Z     ) -> None:
2025-05-07T20:32:27.3504280Z         torch.manual_seed(2025)
2025-05-07T20:32:27.3504549Z     
2025-05-07T20:32:27.3504861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.3505240Z     
2025-05-07T20:32:27.3505462Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.3505790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.3506141Z         x = x_sign * x_clamp
2025-05-07T20:32:27.3506407Z         x0 = x[:, :D]
2025-05-07T20:32:27.3506652Z         x1 = x[:, D:]
2025-05-07T20:32:27.3506891Z     
2025-05-07T20:32:27.3507095Z         if contiguous:
2025-05-07T20:32:27.3507356Z             x0 = x0.contiguous()
2025-05-07T20:32:27.3507648Z             x1 = x1.contiguous()
2025-05-07T20:32:27.3507912Z     
2025-05-07T20:32:27.3508132Z         if scale_ub is not None:
2025-05-07T20:32:27.3508438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.3508929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.3509290Z             )
2025-05-07T20:32:27.3509517Z         else:
2025-05-07T20:32:27.3509752Z             scale_ub_tensor = None
2025-05-07T20:32:27.3510038Z     
2025-05-07T20:32:27.3510307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.3510655Z             op = silu_mul_quant
2025-05-07T20:32:27.3510937Z             if compiled:
2025-05-07T20:32:27.3511219Z                 op = torch.compile(op)
2025-05-07T20:32:27.3511554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3511857Z     
2025-05-07T20:32:27.3512078Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.3512263Z 
2025-05-07T20:32:27.3512382Z moe/activation_test.py:117: 
2025-05-07T20:32:27.3512712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3513143Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.3513676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3514461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.3515232Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.3515956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.3516725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.3517466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.3518062Z     kernel = self.compile(
2025-05-07T20:32:27.3518672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.3519410Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.3519853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3520115Z 
2025-05-07T20:32:27.3520348Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d02b0>
2025-05-07T20:32:27.3521553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.3523095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d7b50>}
2025-05-07T20:32:27.3524914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.3526063Z context = <triton._C.libtriton.ir.context object at 0x7f76760d8f30>
2025-05-07T20:32:27.3526387Z 
2025-05-07T20:32:27.3526583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.3527171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.3527701Z                            module_map=module_map)
2025-05-07T20:32:27.3528117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.3528512Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.3528804Z E       ^
2025-05-07T20:32:27.3529322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.3529834Z 
2025-05-07T20:32:27.3530300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.3530872Z 
2025-05-07T20:32:27.3530994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.3531460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.3532056Z     T=128,
2025-05-07T20:32:27.3532277Z     D=5120,
2025-05-07T20:32:27.3532495Z     scale_ub=None,
2025-05-07T20:32:27.3532741Z     contiguous=True,
2025-05-07T20:32:27.3532994Z     compiled=False,
2025-05-07T20:32:27.3533235Z )
2025-05-07T20:32:27.4376408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.4377197Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.4377621Z 
2025-05-07T20:32:27.4377744Z     @given(
2025-05-07T20:32:27.4378271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.4379058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.4379946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.4380586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.4381209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.4381744Z     )
2025-05-07T20:32:27.4382428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.4383274Z     def test_silu_mul_quant(
2025-05-07T20:32:27.4383727Z         self,
2025-05-07T20:32:27.4384102Z         T: int,
2025-05-07T20:32:27.4384474Z         D: int,
2025-05-07T20:32:27.4385200Z         scale_ub: Optional[float],
2025-05-07T20:32:27.4385717Z         contiguous: bool,
2025-05-07T20:32:27.4386172Z         compiled: bool,
2025-05-07T20:32:27.4386595Z     ) -> None:
2025-05-07T20:32:27.4387008Z         torch.manual_seed(2025)
2025-05-07T20:32:27.4387468Z     
2025-05-07T20:32:27.4387976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.4388528Z     
2025-05-07T20:32:27.4388749Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.4389071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.4389412Z         x = x_sign * x_clamp
2025-05-07T20:32:27.4389687Z         x0 = x[:, :D]
2025-05-07T20:32:27.4389928Z         x1 = x[:, D:]
2025-05-07T20:32:27.4390156Z     
2025-05-07T20:32:27.4390367Z         if contiguous:
2025-05-07T20:32:27.4390628Z             x0 = x0.contiguous()
2025-05-07T20:32:27.4390914Z             x1 = x1.contiguous()
2025-05-07T20:32:27.4391182Z     
2025-05-07T20:32:27.4391401Z         if scale_ub is not None:
2025-05-07T20:32:27.4391702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.4392076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.4392424Z             )
2025-05-07T20:32:27.4392632Z         else:
2025-05-07T20:32:27.4392867Z             scale_ub_tensor = None
2025-05-07T20:32:27.4393147Z     
2025-05-07T20:32:27.4393400Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.4393815Z             op = silu_mul_quant
2025-05-07T20:32:27.4394094Z             if compiled:
2025-05-07T20:32:27.4394369Z                 op = torch.compile(op)
2025-05-07T20:32:27.4394694Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.4394996Z     
2025-05-07T20:32:27.4395216Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.4395401Z 
2025-05-07T20:32:27.4395516Z moe/activation_test.py:117: 
2025-05-07T20:32:27.4395847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.4396219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.4396528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.4397304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.4398080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.4398679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.4399433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.4400171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.4400898Z     kernel = self.compile(
2025-05-07T20:32:27.4401503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.4402236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.4402680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.4402933Z 
2025-05-07T20:32:27.4403168Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d39d0>
2025-05-07T20:32:27.4404367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.4405907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676008670>}
2025-05-07T20:32:27.4407416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.4408655Z context = <triton._C.libtriton.ir.context object at 0x7f7676093030>
2025-05-07T20:32:27.4408974Z 
2025-05-07T20:32:27.4409165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.4409743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.4410267Z                            module_map=module_map)
2025-05-07T20:32:27.4410675Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.4411066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.4411355Z E       ^
2025-05-07T20:32:27.4411874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.4412375Z 
2025-05-07T20:32:27.4412848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.4413420Z 
2025-05-07T20:32:27.4413536Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.4414004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.4414451Z     T=128,
2025-05-07T20:32:27.4414654Z     D=7168,
2025-05-07T20:32:27.4414873Z     scale_ub=None,
2025-05-07T20:32:27.4415112Z     contiguous=True,
2025-05-07T20:32:27.4415356Z     compiled=False,
2025-05-07T20:32:27.4415589Z )
2025-05-07T20:32:27.4415946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.4416488Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.4416784Z 
2025-05-07T20:32:27.4416871Z     @given(
2025-05-07T20:32:27.4417127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.4417478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.4417821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.4418224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.4418606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.4418926Z     )
2025-05-07T20:32:27.4419318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.4419811Z     def test_silu_mul_quant(
2025-05-07T20:32:27.4420182Z         self,
2025-05-07T20:32:27.4420437Z         T: int,
2025-05-07T20:32:27.4420658Z         D: int,
2025-05-07T20:32:27.4420905Z         scale_ub: Optional[float],
2025-05-07T20:32:27.4421204Z         contiguous: bool,
2025-05-07T20:32:27.4421473Z         compiled: bool,
2025-05-07T20:32:27.4421726Z     ) -> None:
2025-05-07T20:32:27.4421960Z         torch.manual_seed(2025)
2025-05-07T20:32:27.4422229Z     
2025-05-07T20:32:27.4422533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.4422909Z     
2025-05-07T20:32:27.4423258Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.4423586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.4424223Z         x = x_sign * x_clamp
2025-05-07T20:32:27.4424495Z         x0 = x[:, :D]
2025-05-07T20:32:27.4424745Z         x1 = x[:, D:]
2025-05-07T20:32:27.4424981Z     
2025-05-07T20:32:27.4425187Z         if contiguous:
2025-05-07T20:32:27.4425454Z             x0 = x0.contiguous()
2025-05-07T20:32:27.4425748Z             x1 = x1.contiguous()
2025-05-07T20:32:27.4426012Z     
2025-05-07T20:32:27.4426231Z         if scale_ub is not None:
2025-05-07T20:32:27.4426546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.4426918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.4427266Z             )
2025-05-07T20:32:27.4427485Z         else:
2025-05-07T20:32:27.4427716Z             scale_ub_tensor = None
2025-05-07T20:32:27.4428006Z     
2025-05-07T20:32:27.4428322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.4428675Z             op = silu_mul_quant
2025-05-07T20:32:27.4428958Z             if compiled:
2025-05-07T20:32:27.4429238Z                 op = torch.compile(op)
2025-05-07T20:32:27.4429748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.4430062Z     
2025-05-07T20:32:27.4430279Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.4430464Z 
2025-05-07T20:32:27.4430581Z moe/activation_test.py:117: 
2025-05-07T20:32:27.4431006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.4431573Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.4431893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.4432669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.4433441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.4434118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.4434884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.4435623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.4436229Z     kernel = self.compile(
2025-05-07T20:32:27.4436840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.4437575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.4438022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.4438287Z 
2025-05-07T20:32:27.4438517Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d2080>
2025-05-07T20:32:27.4439740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.4441275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676008ee0>}
2025-05-07T20:32:27.4442850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.4443994Z context = <triton._C.libtriton.ir.context object at 0x7f76760455b0>
2025-05-07T20:32:27.4444320Z 
2025-05-07T20:32:27.4444510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.4445093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.4445617Z                            module_map=module_map)
2025-05-07T20:32:27.4446187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.4446586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.4446876Z E       ^
2025-05-07T20:32:27.4447394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.4447907Z 
2025-05-07T20:32:27.4448370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.4448942Z 
2025-05-07T20:32:27.4449067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.4449524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.4449978Z     T=2048,
2025-05-07T20:32:27.4450191Z     D=7168,
2025-05-07T20:32:27.4450406Z     scale_ub=1200.0,
2025-05-07T20:32:27.4450660Z     contiguous=True,
2025-05-07T20:32:27.4450915Z     compiled=False,
2025-05-07T20:32:27.4451139Z )
2025-05-07T20:32:27.5472640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5474204Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.5474949Z 
2025-05-07T20:32:27.5475122Z     @given(
2025-05-07T20:32:27.5475570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5476482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5477067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5477704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5478244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5478605Z     )
2025-05-07T20:32:27.5478996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5479484Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5479747Z         self,
2025-05-07T20:32:27.5479967Z         T: int,
2025-05-07T20:32:27.5480189Z         D: int,
2025-05-07T20:32:27.5480428Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5480736Z         contiguous: bool,
2025-05-07T20:32:27.5481001Z         compiled: bool,
2025-05-07T20:32:27.5481246Z     ) -> None:
2025-05-07T20:32:27.5481486Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5481754Z     
2025-05-07T20:32:27.5482059Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5484324Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.5486374Z 
2025-05-07T20:32:27.5486505Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.5486753Z 
2025-05-07T20:32:27.5486871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5487328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5487774Z     T=1,
2025-05-07T20:32:27.5487982Z     D=5120,
2025-05-07T20:32:27.5488233Z     scale_ub=1200.0,
2025-05-07T20:32:27.5488500Z     contiguous=True,
2025-05-07T20:32:27.5488745Z     compiled=False,
2025-05-07T20:32:27.5488973Z )
2025-05-07T20:32:27.5489325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5489854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.5490151Z 
2025-05-07T20:32:27.5490237Z     @given(
2025-05-07T20:32:27.5490491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5490832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5491171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5491666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5492029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5492350Z     )
2025-05-07T20:32:27.5492742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5493240Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5493504Z         self,
2025-05-07T20:32:27.5493721Z         T: int,
2025-05-07T20:32:27.5493942Z         D: int,
2025-05-07T20:32:27.5494179Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5494482Z         contiguous: bool,
2025-05-07T20:32:27.5494749Z         compiled: bool,
2025-05-07T20:32:27.5494993Z     ) -> None:
2025-05-07T20:32:27.5495232Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5495498Z     
2025-05-07T20:32:27.5495794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5496170Z     
2025-05-07T20:32:27.5496390Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.5496713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.5497059Z         x = x_sign * x_clamp
2025-05-07T20:32:27.5497327Z         x0 = x[:, :D]
2025-05-07T20:32:27.5497564Z         x1 = x[:, D:]
2025-05-07T20:32:27.5497798Z     
2025-05-07T20:32:27.5498125Z         if contiguous:
2025-05-07T20:32:27.5498401Z             x0 = x0.contiguous()
2025-05-07T20:32:27.5498693Z             x1 = x1.contiguous()
2025-05-07T20:32:27.5498962Z     
2025-05-07T20:32:27.5499174Z         if scale_ub is not None:
2025-05-07T20:32:27.5499473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.5499845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.5500187Z             )
2025-05-07T20:32:27.5500397Z         else:
2025-05-07T20:32:27.5500633Z             scale_ub_tensor = None
2025-05-07T20:32:27.5500910Z     
2025-05-07T20:32:27.5501164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.5501511Z             op = silu_mul_quant
2025-05-07T20:32:27.5501792Z             if compiled:
2025-05-07T20:32:27.5502062Z                 op = torch.compile(op)
2025-05-07T20:32:27.5502390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5502693Z     
2025-05-07T20:32:27.5502902Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.5503095Z 
2025-05-07T20:32:27.5503205Z moe/activation_test.py:117: 
2025-05-07T20:32:27.5503534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5503900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.5504206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5504963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.5505734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.5506325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.5507077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.5507802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.5508440Z     kernel = self.compile(
2025-05-07T20:32:27.5509035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.5509754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5510191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5510441Z 
2025-05-07T20:32:27.5510675Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760fb760>
2025-05-07T20:32:27.5511848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.5513429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676009e10>}
2025-05-07T20:32:27.5514994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.5516124Z context = <triton._C.libtriton.ir.context object at 0x7f7676135eb0>
2025-05-07T20:32:27.5516440Z 
2025-05-07T20:32:27.5516623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.5517199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5517716Z                            module_map=module_map)
2025-05-07T20:32:27.5518149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5518557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.5518851Z E       ^
2025-05-07T20:32:27.5519375Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5519872Z 
2025-05-07T20:32:27.5520327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.5520983Z 
2025-05-07T20:32:27.5521098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5521552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5521996Z     T=2048,
2025-05-07T20:32:27.5522199Z     D=5120,
2025-05-07T20:32:27.5522411Z     scale_ub=None,
2025-05-07T20:32:27.5522649Z     contiguous=True,
2025-05-07T20:32:27.5522891Z     compiled=False,
2025-05-07T20:32:27.5523115Z )
2025-05-07T20:32:27.5523470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5524340Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.5524649Z 
2025-05-07T20:32:27.5524734Z     @given(
2025-05-07T20:32:27.5524987Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5525335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5525673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5526035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5526396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5526703Z     )
2025-05-07T20:32:27.5527085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5527566Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5527829Z         self,
2025-05-07T20:32:27.5528067Z         T: int,
2025-05-07T20:32:27.5528313Z         D: int,
2025-05-07T20:32:27.5528548Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5528848Z         contiguous: bool,
2025-05-07T20:32:27.5529116Z         compiled: bool,
2025-05-07T20:32:27.5529357Z     ) -> None:
2025-05-07T20:32:27.5529599Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5529864Z     
2025-05-07T20:32:27.5530168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5530538Z     
2025-05-07T20:32:27.5530771Z >       x_sign = torch.sign(x)
2025-05-07T20:32:27.5538791Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.5540827Z 
2025-05-07T20:32:27.5540966Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:27.5541202Z 
2025-05-07T20:32:27.5541490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5541945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5542384Z     T=16384,
2025-05-07T20:32:27.5542601Z     D=5120,
2025-05-07T20:32:27.5542816Z     scale_ub=None,
2025-05-07T20:32:27.5543052Z     contiguous=True,
2025-05-07T20:32:27.5543302Z     compiled=False,
2025-05-07T20:32:27.5543523Z )
2025-05-07T20:32:27.6583187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6584056Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.6584499Z 
2025-05-07T20:32:27.6584637Z     @given(
2025-05-07T20:32:27.6584990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6585444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6585788Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6586152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6586528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6586851Z     )
2025-05-07T20:32:27.6587243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6587933Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6588247Z         self,
2025-05-07T20:32:27.6588480Z         T: int,
2025-05-07T20:32:27.6588699Z         D: int,
2025-05-07T20:32:27.6588939Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6589250Z         contiguous: bool,
2025-05-07T20:32:27.6589515Z         compiled: bool,
2025-05-07T20:32:27.6589775Z     ) -> None:
2025-05-07T20:32:27.6590019Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6590286Z     
2025-05-07T20:32:27.6590613Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6592939Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.6595136Z 
2025-05-07T20:32:27.6595273Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.6595512Z 
2025-05-07T20:32:27.6595637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6596099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6596554Z     T=4096,
2025-05-07T20:32:27.6596765Z     D=5120,
2025-05-07T20:32:27.6596978Z     scale_ub=None,
2025-05-07T20:32:27.6597219Z     contiguous=True,
2025-05-07T20:32:27.6597470Z     compiled=False,
2025-05-07T20:32:27.6597704Z )
2025-05-07T20:32:27.6598060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6598654Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.6598960Z 
2025-05-07T20:32:27.6599046Z     @given(
2025-05-07T20:32:27.6599310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6599661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6600000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6600371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6600744Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6601058Z     )
2025-05-07T20:32:27.6601452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6601947Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6602215Z         self,
2025-05-07T20:32:27.6602433Z         T: int,
2025-05-07T20:32:27.6602656Z         D: int,
2025-05-07T20:32:27.6603037Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6603350Z         contiguous: bool,
2025-05-07T20:32:27.6603620Z         compiled: bool,
2025-05-07T20:32:27.6603876Z     ) -> None:
2025-05-07T20:32:27.6604114Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6604390Z     
2025-05-07T20:32:27.6604693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6606982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.6609124Z 
2025-05-07T20:32:27.6609262Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.6609508Z 
2025-05-07T20:32:27.6609625Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6610086Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6610623Z     T=2048,
2025-05-07T20:32:27.6610830Z     D=5120,
2025-05-07T20:32:27.6611043Z     scale_ub=None,
2025-05-07T20:32:27.6611283Z     contiguous=False,
2025-05-07T20:32:27.6611534Z     compiled=False,
2025-05-07T20:32:27.6611760Z )
2025-05-07T20:32:27.6612114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6612663Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.6612974Z 
2025-05-07T20:32:27.6613060Z     @given(
2025-05-07T20:32:27.6613324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6613676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6614034Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6614416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6614786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6615102Z     )
2025-05-07T20:32:27.6615496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6615998Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6616263Z         self,
2025-05-07T20:32:27.6616479Z         T: int,
2025-05-07T20:32:27.6616699Z         D: int,
2025-05-07T20:32:27.6616941Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6617243Z         contiguous: bool,
2025-05-07T20:32:27.6617511Z         compiled: bool,
2025-05-07T20:32:27.6617759Z     ) -> None:
2025-05-07T20:32:27.6618000Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6618272Z     
2025-05-07T20:32:27.6618616Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6620882Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.6622971Z 
2025-05-07T20:32:27.6623103Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.6623345Z 
2025-05-07T20:32:27.6623463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6624329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6624782Z     T=4096,
2025-05-07T20:32:27.6624998Z     D=7168,
2025-05-07T20:32:27.6625218Z     scale_ub=None,
2025-05-07T20:32:27.6625453Z     contiguous=True,
2025-05-07T20:32:27.6625703Z     compiled=True,
2025-05-07T20:32:27.6626078Z )
2025-05-07T20:32:27.6626430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6626978Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:27.6627285Z 
2025-05-07T20:32:27.6627371Z     @given(
2025-05-07T20:32:27.6627632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6627980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6628375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6628744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6629108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6629433Z     )
2025-05-07T20:32:27.6629825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6630312Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6630583Z         self,
2025-05-07T20:32:27.6630802Z         T: int,
2025-05-07T20:32:27.6631027Z         D: int,
2025-05-07T20:32:27.6631269Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6631571Z         contiguous: bool,
2025-05-07T20:32:27.6631841Z         compiled: bool,
2025-05-07T20:32:27.6632085Z     ) -> None:
2025-05-07T20:32:27.6632460Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6632749Z     
2025-05-07T20:32:27.6633050Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6635395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.6637454Z 
2025-05-07T20:32:27.6637586Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.6637828Z 
2025-05-07T20:32:27.6637944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6638413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6638854Z     T=2048,
2025-05-07T20:32:27.6639065Z     D=5120,
2025-05-07T20:32:27.6639278Z     scale_ub=1200.0,
2025-05-07T20:32:27.6639523Z     contiguous=False,
2025-05-07T20:32:27.6639775Z     compiled=False,
2025-05-07T20:32:27.6640003Z )
2025-05-07T20:32:27.6640350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.6640896Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:27.6641206Z 
2025-05-07T20:32:27.6641291Z     @given(
2025-05-07T20:32:27.6641548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.6641893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.6642239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.6642605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.6642964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.6643289Z     )
2025-05-07T20:32:27.6643676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.6644160Z     def test_silu_mul_quant(
2025-05-07T20:32:27.6644429Z         self,
2025-05-07T20:32:27.6644649Z         T: int,
2025-05-07T20:32:27.6644868Z         D: int,
2025-05-07T20:32:27.6645108Z         scale_ub: Optional[float],
2025-05-07T20:32:27.6645413Z         contiguous: bool,
2025-05-07T20:32:27.6645681Z         compiled: bool,
2025-05-07T20:32:27.6645923Z     ) -> None:
2025-05-07T20:32:27.6646161Z         torch.manual_seed(2025)
2025-05-07T20:32:27.6646431Z     
2025-05-07T20:32:27.6646727Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.6649159Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.6651240Z 
2025-05-07T20:32:27.6651373Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.6651610Z 
2025-05-07T20:32:27.6651732Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.6652196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.6652640Z     T=4096,
2025-05-07T20:32:27.6652850Z     D=7168,
2025-05-07T20:32:27.6653066Z     scale_ub=1200.0,
2025-05-07T20:32:27.6653317Z     contiguous=True,
2025-05-07T20:32:27.6653568Z     compiled=False,
2025-05-07T20:32:27.6653797Z )
2025-05-07T20:32:27.8067667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8069488Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.8070373Z 
2025-05-07T20:32:27.8070592Z     @given(
2025-05-07T20:32:27.8071102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8071693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8072275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8072924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8073713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8074257Z     )
2025-05-07T20:32:27.8074933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8075781Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8076253Z         self,
2025-05-07T20:32:27.8076632Z         T: int,
2025-05-07T20:32:27.8077008Z         D: int,
2025-05-07T20:32:27.8077418Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8077972Z         contiguous: bool,
2025-05-07T20:32:27.8078530Z         compiled: bool,
2025-05-07T20:32:27.8078959Z     ) -> None:
2025-05-07T20:32:27.8079270Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8079619Z     
2025-05-07T20:32:27.8079991Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8082357Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.8084496Z 
2025-05-07T20:32:27.8084633Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.8084890Z 
2025-05-07T20:32:27.8085009Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8085486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8085939Z     T=16384,
2025-05-07T20:32:27.8086169Z     D=7168,
2025-05-07T20:32:27.8086392Z     scale_ub=None,
2025-05-07T20:32:27.8086647Z     contiguous=False,
2025-05-07T20:32:27.8086904Z     compiled=True,
2025-05-07T20:32:27.8087139Z )
2025-05-07T20:32:27.8087506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8088073Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.8088398Z 
2025-05-07T20:32:27.8088488Z     @given(
2025-05-07T20:32:27.8088756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8089262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8089625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8090007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8090491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8090868Z     )
2025-05-07T20:32:27.8091275Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8091785Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8092061Z         self,
2025-05-07T20:32:27.8092287Z         T: int,
2025-05-07T20:32:27.8092516Z         D: int,
2025-05-07T20:32:27.8092772Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8093094Z         contiguous: bool,
2025-05-07T20:32:27.8093378Z         compiled: bool,
2025-05-07T20:32:27.8093633Z     ) -> None:
2025-05-07T20:32:27.8093888Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8094168Z     
2025-05-07T20:32:27.8094483Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8096832Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.8099138Z 
2025-05-07T20:32:27.8099280Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.8099530Z 
2025-05-07T20:32:27.8099653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8100142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8100605Z     T=4096,
2025-05-07T20:32:27.8100833Z     D=7168,
2025-05-07T20:32:27.8101092Z     scale_ub=None,
2025-05-07T20:32:27.8101451Z     contiguous=True,
2025-05-07T20:32:27.8101841Z     compiled=False,
2025-05-07T20:32:27.8102144Z )
2025-05-07T20:32:27.8102507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8103085Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.8103402Z 
2025-05-07T20:32:27.8103492Z     @given(
2025-05-07T20:32:27.8103760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8104117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8104469Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8104851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8105225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8105555Z     )
2025-05-07T20:32:27.8105960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8106471Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8106755Z         self,
2025-05-07T20:32:27.8106981Z         T: int,
2025-05-07T20:32:27.8107203Z         D: int,
2025-05-07T20:32:27.8107456Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8107779Z         contiguous: bool,
2025-05-07T20:32:27.8108050Z         compiled: bool,
2025-05-07T20:32:27.8108316Z     ) -> None:
2025-05-07T20:32:27.8108608Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8108889Z     
2025-05-07T20:32:27.8109195Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8111665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.8113906Z 
2025-05-07T20:32:27.8114051Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.8114298Z 
2025-05-07T20:32:27.8114425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8114897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8115379Z     T=16384,
2025-05-07T20:32:27.8115603Z     D=7168,
2025-05-07T20:32:27.8115831Z     scale_ub=None,
2025-05-07T20:32:27.8116071Z     contiguous=True,
2025-05-07T20:32:27.8116334Z     compiled=False,
2025-05-07T20:32:27.8116573Z )
2025-05-07T20:32:27.8116929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8117496Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:27.8117811Z 
2025-05-07T20:32:27.8117914Z     @given(
2025-05-07T20:32:27.8118179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8118540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8118897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8119372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8119754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8120084Z     )
2025-05-07T20:32:27.8120491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8120993Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8121276Z         self,
2025-05-07T20:32:27.8121503Z         T: int,
2025-05-07T20:32:27.8121726Z         D: int,
2025-05-07T20:32:27.8121978Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8122294Z         contiguous: bool,
2025-05-07T20:32:27.8122570Z         compiled: bool,
2025-05-07T20:32:27.8122828Z     ) -> None:
2025-05-07T20:32:27.8123078Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8123356Z     
2025-05-07T20:32:27.8123668Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8126334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.8128510Z 
2025-05-07T20:32:27.8128646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.8128888Z 
2025-05-07T20:32:27.8129014Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8129489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8129952Z     T=16384,
2025-05-07T20:32:27.8130176Z     D=7168,
2025-05-07T20:32:27.8130392Z     scale_ub=1200.0,
2025-05-07T20:32:27.8130649Z     contiguous=True,
2025-05-07T20:32:27.8130904Z     compiled=False,
2025-05-07T20:32:27.8131140Z )
2025-05-07T20:32:27.8131499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8132067Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:27.8132381Z 
2025-05-07T20:32:27.8132475Z     @given(
2025-05-07T20:32:27.8132737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8133095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8133448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8133820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8134197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8134527Z     )
2025-05-07T20:32:27.8135085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8135596Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8135875Z         self,
2025-05-07T20:32:27.8136097Z         T: int,
2025-05-07T20:32:27.8136327Z         D: int,
2025-05-07T20:32:27.8136583Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8136891Z         contiguous: bool,
2025-05-07T20:32:27.8137167Z         compiled: bool,
2025-05-07T20:32:27.8137425Z     ) -> None:
2025-05-07T20:32:27.8137675Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8137946Z     
2025-05-07T20:32:27.8138283Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8140627Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:27.8142871Z 
2025-05-07T20:32:27.8143013Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:27.8143255Z 
2025-05-07T20:32:27.8143374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8143852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8144314Z     T=128,
2025-05-07T20:32:27.8144534Z     D=5120,
2025-05-07T20:32:27.8144752Z     scale_ub=1200.0,
2025-05-07T20:32:27.8145011Z     contiguous=False,
2025-05-07T20:32:27.8145272Z     compiled=False,
2025-05-07T20:32:27.8145503Z )
2025-05-07T20:32:28.1910752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.1911856Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.1912412Z 
2025-05-07T20:32:28.1912574Z     @given(
2025-05-07T20:32:28.1913047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.1913819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.1914448Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.1915116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.1915783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.1916351Z     )
2025-05-07T20:32:28.1917058Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.1917949Z     def test_silu_mul_quant(
2025-05-07T20:32:28.1918403Z         self,
2025-05-07T20:32:28.1918658Z         T: int,
2025-05-07T20:32:28.1918914Z         D: int,
2025-05-07T20:32:28.1919170Z         scale_ub: Optional[float],
2025-05-07T20:32:28.1919482Z         contiguous: bool,
2025-05-07T20:32:28.1919765Z         compiled: bool,
2025-05-07T20:32:28.1920036Z     ) -> None:
2025-05-07T20:32:28.1920286Z         torch.manual_seed(2025)
2025-05-07T20:32:28.1920568Z     
2025-05-07T20:32:28.1920896Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.1921292Z     
2025-05-07T20:32:28.1921524Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.1921862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.1922218Z         x = x_sign * x_clamp
2025-05-07T20:32:28.1922496Z         x0 = x[:, :D]
2025-05-07T20:32:28.1922748Z         x1 = x[:, D:]
2025-05-07T20:32:28.1922982Z     
2025-05-07T20:32:28.1923198Z         if contiguous:
2025-05-07T20:32:28.1923465Z             x0 = x0.contiguous()
2025-05-07T20:32:28.1924197Z             x1 = x1.contiguous()
2025-05-07T20:32:28.1924503Z     
2025-05-07T20:32:28.1924728Z         if scale_ub is not None:
2025-05-07T20:32:28.1925044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.1925432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.1926002Z             )
2025-05-07T20:32:28.1926243Z         else:
2025-05-07T20:32:28.1926485Z             scale_ub_tensor = None
2025-05-07T20:32:28.1926780Z     
2025-05-07T20:32:28.1933993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.1934405Z             op = silu_mul_quant
2025-05-07T20:32:28.1934695Z             if compiled:
2025-05-07T20:32:28.1934985Z                 op = torch.compile(op)
2025-05-07T20:32:28.1935331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1935639Z     
2025-05-07T20:32:28.1935864Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.1936052Z 
2025-05-07T20:32:28.1936172Z moe/activation_test.py:117: 
2025-05-07T20:32:28.1936516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1936892Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.1937213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1938020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.1938796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.1939408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.1940357Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.1941099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.1941704Z     kernel = self.compile(
2025-05-07T20:32:28.1942327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.1943078Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.1943524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1943792Z 
2025-05-07T20:32:28.1944032Z self = <triton.compiler.compiler.ASTSource object at 0x7f7675f63250>
2025-05-07T20:32:28.1945246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.1946805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7675f9dcf0>}
2025-05-07T20:32:28.1948308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.1949466Z context = <triton._C.libtriton.ir.context object at 0x7f7675e4dbb0>
2025-05-07T20:32:28.1949799Z 
2025-05-07T20:32:28.1949988Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.1950585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.1951111Z                            module_map=module_map)
2025-05-07T20:32:28.1951531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.1951944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.1952244Z E       ^
2025-05-07T20:32:28.1952763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.1953274Z 
2025-05-07T20:32:28.1953871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.1954452Z 
2025-05-07T20:32:28.1954581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.1955048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.1955503Z     T=2048,
2025-05-07T20:32:28.1955724Z     D=7168,
2025-05-07T20:32:28.1956042Z     scale_ub=None,
2025-05-07T20:32:28.1956288Z     contiguous=False,
2025-05-07T20:32:28.1956547Z     compiled=False,
2025-05-07T20:32:28.1956781Z )
2025-05-07T20:32:28.1957140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.1957700Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.1958010Z 
2025-05-07T20:32:28.1958097Z     @given(
2025-05-07T20:32:28.1958360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.1958717Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.1959065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.1959442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.1959818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.1960135Z     )
2025-05-07T20:32:28.1960531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.1961041Z     def test_silu_mul_quant(
2025-05-07T20:32:28.1961323Z         self,
2025-05-07T20:32:28.1961546Z         T: int,
2025-05-07T20:32:28.1961775Z         D: int,
2025-05-07T20:32:28.1962015Z         scale_ub: Optional[float],
2025-05-07T20:32:28.1962417Z         contiguous: bool,
2025-05-07T20:32:28.1962691Z         compiled: bool,
2025-05-07T20:32:28.1962943Z     ) -> None:
2025-05-07T20:32:28.1963186Z         torch.manual_seed(2025)
2025-05-07T20:32:28.1963469Z     
2025-05-07T20:32:28.1963778Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.1966097Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.1968345Z 
2025-05-07T20:32:28.1968485Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.1968737Z 
2025-05-07T20:32:28.1968854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.1969322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.1969773Z     T=128,
2025-05-07T20:32:28.1969979Z     D=7168,
2025-05-07T20:32:28.1970194Z     scale_ub=1200.0,
2025-05-07T20:32:28.1970446Z     contiguous=True,
2025-05-07T20:32:28.1970700Z     compiled=True,
2025-05-07T20:32:28.1970930Z )
2025-05-07T20:32:28.2427201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.2427808Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.2428120Z 
2025-05-07T20:32:28.2428211Z     @given(
2025-05-07T20:32:28.2428490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.2428844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.2429195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.2429562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.2429940Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.2430264Z     )
2025-05-07T20:32:28.2430654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.2431156Z     def test_silu_mul_quant(
2025-05-07T20:32:28.2431432Z         self,
2025-05-07T20:32:28.2431648Z         T: int,
2025-05-07T20:32:28.2431873Z         D: int,
2025-05-07T20:32:28.2432125Z         scale_ub: Optional[float],
2025-05-07T20:32:28.2432428Z         contiguous: bool,
2025-05-07T20:32:28.2432703Z         compiled: bool,
2025-05-07T20:32:28.2432960Z     ) -> None:
2025-05-07T20:32:28.2433201Z         torch.manual_seed(2025)
2025-05-07T20:32:28.2433472Z     
2025-05-07T20:32:28.2434051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.2434439Z     
2025-05-07T20:32:28.2434665Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.2434996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.2435353Z         x = x_sign * x_clamp
2025-05-07T20:32:28.2435621Z         x0 = x[:, :D]
2025-05-07T20:32:28.2435864Z         x1 = x[:, D:]
2025-05-07T20:32:28.2436099Z     
2025-05-07T20:32:28.2436306Z         if contiguous:
2025-05-07T20:32:28.2436567Z             x0 = x0.contiguous()
2025-05-07T20:32:28.2436859Z             x1 = x1.contiguous()
2025-05-07T20:32:28.2437128Z     
2025-05-07T20:32:28.2437348Z         if scale_ub is not None:
2025-05-07T20:32:28.2437660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.2438035Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.2438386Z             )
2025-05-07T20:32:28.2438606Z         else:
2025-05-07T20:32:28.2438849Z             scale_ub_tensor = None
2025-05-07T20:32:28.2439133Z     
2025-05-07T20:32:28.2439399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.2439748Z             op = silu_mul_quant
2025-05-07T20:32:28.2440033Z             if compiled:
2025-05-07T20:32:28.2440462Z                 op = torch.compile(op)
2025-05-07T20:32:28.2440799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.2441106Z     
2025-05-07T20:32:28.2441326Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.2441510Z 
2025-05-07T20:32:28.2441631Z moe/activation_test.py:117: 
2025-05-07T20:32:28.2441960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2442338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.2442657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.2443285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.2443917Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.2444665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.2445438Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.2446042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.2446812Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.2447560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.2448158Z     kernel = self.compile(
2025-05-07T20:32:28.2448768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.2449510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.2449956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.2450217Z 
2025-05-07T20:32:28.2450451Z self = <triton.compiler.compiler.ASTSource object at 0x7f7675eab400>
2025-05-07T20:32:28.2451665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.2453219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7675f9f0a0>}
2025-05-07T20:32:28.2454733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.2455889Z context = <triton._C.libtriton.ir.context object at 0x7f7675e804f0>
2025-05-07T20:32:28.2456212Z 
2025-05-07T20:32:28.2456489Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.2457082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.2457612Z                            module_map=module_map)
2025-05-07T20:32:28.2458024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.2458423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.2458717Z E       ^
2025-05-07T20:32:28.2459264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.2459772Z 
2025-05-07T20:32:28.2460246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.2460822Z 
2025-05-07T20:32:28.2460941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.2461413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.2461865Z     T=128,
2025-05-07T20:32:28.2462079Z     D=7168,
2025-05-07T20:32:28.2462306Z     scale_ub=1200.0,
2025-05-07T20:32:28.2462564Z     contiguous=True,
2025-05-07T20:32:28.2462818Z     compiled=False,
2025-05-07T20:32:28.2463046Z )
2025-05-07T20:32:28.2463502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.2464057Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.2464361Z 
2025-05-07T20:32:28.2464448Z     @given(
2025-05-07T20:32:28.2464712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.2465066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.2465408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.2465782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.2466154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.2466472Z     )
2025-05-07T20:32:28.2466865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.2467364Z     def test_silu_mul_quant(
2025-05-07T20:32:28.2467638Z         self,
2025-05-07T20:32:28.2467853Z         T: int,
2025-05-07T20:32:28.2468078Z         D: int,
2025-05-07T20:32:28.2468330Z         scale_ub: Optional[float],
2025-05-07T20:32:28.2468631Z         contiguous: bool,
2025-05-07T20:32:28.2468905Z         compiled: bool,
2025-05-07T20:32:28.2469159Z     ) -> None:
2025-05-07T20:32:28.2469399Z         torch.manual_seed(2025)
2025-05-07T20:32:28.2469672Z     
2025-05-07T20:32:28.2469978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.2470364Z     
2025-05-07T20:32:28.2470588Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.2470916Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.2473181Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.2475315Z 
2025-05-07T20:32:28.2475457Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.2475698Z 
2025-05-07T20:32:28.2475815Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.2476288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.2476735Z     T=128,
2025-05-07T20:32:28.2476940Z     D=5120,
2025-05-07T20:32:28.2477161Z     scale_ub=1200.0,
2025-05-07T20:32:28.2477417Z     contiguous=True,
2025-05-07T20:32:28.2477666Z     compiled=True,
2025-05-07T20:32:28.2477898Z )
2025-05-07T20:32:28.2478386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.2478984Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.2479293Z 
2025-05-07T20:32:28.2479383Z     @given(
2025-05-07T20:32:28.2479651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.2480005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.2480350Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.2480729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.2481102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.2481422Z     )
2025-05-07T20:32:28.2481822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.2482323Z     def test_silu_mul_quant(
2025-05-07T20:32:28.2482595Z         self,
2025-05-07T20:32:28.2482815Z         T: int,
2025-05-07T20:32:28.2483044Z         D: int,
2025-05-07T20:32:28.2483286Z         scale_ub: Optional[float],
2025-05-07T20:32:28.2483602Z         contiguous: bool,
2025-05-07T20:32:28.2483886Z         compiled: bool,
2025-05-07T20:32:28.2484133Z     ) -> None:
2025-05-07T20:32:28.2484380Z         torch.manual_seed(2025)
2025-05-07T20:32:28.2484654Z     
2025-05-07T20:32:28.2485055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.2485435Z     
2025-05-07T20:32:28.2485660Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.2487829Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.2489899Z 
2025-05-07T20:32:28.2490040Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.2490280Z 
2025-05-07T20:32:28.2490397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.2490872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.2491325Z     T=128,
2025-05-07T20:32:28.2491540Z     D=7168,
2025-05-07T20:32:28.2491754Z     scale_ub=None,
2025-05-07T20:32:28.2491998Z     contiguous=True,
2025-05-07T20:32:28.2492251Z     compiled=True,
2025-05-07T20:32:28.2492481Z )
2025-05-07T20:32:28.5621050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5621741Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.5622046Z 
2025-05-07T20:32:28.5622148Z     @given(
2025-05-07T20:32:28.5622414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5622781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5623142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5623516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5624206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5624548Z     )
2025-05-07T20:32:28.5624944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5625442Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5625718Z         self,
2025-05-07T20:32:28.5625937Z         T: int,
2025-05-07T20:32:28.5626165Z         D: int,
2025-05-07T20:32:28.5626414Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5626719Z         contiguous: bool,
2025-05-07T20:32:28.5626996Z         compiled: bool,
2025-05-07T20:32:28.5627255Z     ) -> None:
2025-05-07T20:32:28.5627502Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5627774Z     
2025-05-07T20:32:28.5628089Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5630698Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.5633331Z 
2025-05-07T20:32:28.5633475Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.5633805Z 
2025-05-07T20:32:28.5694054Z FAILED
2025-05-07T20:32:28.5694203Z 
2025-05-07T20:32:28.5694407Z =================================== FAILURES ===================================
2025-05-07T20:32:28.5695012Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:28.5695562Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:28.5696406Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:28.5697115Z   |     yield
2025-05-07T20:32:28.5697895Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:28.5698648Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:28.5699413Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:28.5700248Z   |     method()
2025-05-07T20:32:28.5701239Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:28.5702393Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5703406Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:28.5704384Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:28.5705144Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:28.5705917Z   +-+---------------- 1 ----------------
2025-05-07T20:32:28.5706374Z     | Traceback (most recent call last):
2025-05-07T20:32:28.5707489Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:28.5708706Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5711871Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.5715024Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:28.5715712Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5716366Z     |     T=128,
2025-05-07T20:32:28.5716689Z     |     D=7168,
2025-05-07T20:32:28.5717039Z     |     scale_ub=1200.0,
2025-05-07T20:32:28.5717416Z     |     contiguous=True,
2025-05-07T20:32:28.5717803Z     |     compiled=False,
2025-05-07T20:32:28.5718170Z     | )
2025-05-07T20:32:28.5718381Z     | 
2025-05-07T20:32:28.5718985Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:28.5719674Z     +---------------- 2 ----------------
2025-05-07T20:32:28.5720159Z     | Traceback (most recent call last):
2025-05-07T20:32:28.5720976Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:28.5721860Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5724462Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.5726686Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:28.5727193Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5727664Z     |     T=128,
2025-05-07T20:32:28.5727901Z     |     D=7168,
2025-05-07T20:32:28.5728149Z     |     scale_ub=None,
2025-05-07T20:32:28.5728589Z     |     contiguous=True,
2025-05-07T20:32:28.5728874Z     |     compiled=True,
2025-05-07T20:32:28.5729136Z     | )
2025-05-07T20:32:28.5729340Z     | 
2025-05-07T20:32:28.5729943Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:28.5730634Z     +---------------- 3 ----------------
2025-05-07T20:32:28.5730963Z     | Traceback (most recent call last):
2025-05-07T20:32:28.5731768Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:28.5732651Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5735335Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.5737547Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:28.5738045Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5738508Z     |     T=128,
2025-05-07T20:32:28.5738743Z     |     D=5120,
2025-05-07T20:32:28.5738984Z     |     scale_ub=1200.0,
2025-05-07T20:32:28.5739266Z     |     contiguous=True,
2025-05-07T20:32:28.5739619Z     |     compiled=True,
2025-05-07T20:32:28.5739881Z     | )
2025-05-07T20:32:28.5740105Z     | 
2025-05-07T20:32:28.5740844Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:28.5741731Z     +---------------- 4 ----------------
2025-05-07T20:32:28.5742129Z     | Traceback (most recent call last):
2025-05-07T20:32:28.5743272Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:28.5744387Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.5745403Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:28.5746478Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5747966Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:28.5749214Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.5750183Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:28.5751331Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5767675Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:28.5768991Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5770253Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:28.5771512Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5772743Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:28.5773975Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.5774988Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:28.5775877Z     |     fn()
2025-05-07T20:32:28.5776766Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:28.5777692Z     |     self.fn.run(
2025-05-07T20:32:28.5778291Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:28.5778955Z     |     kernel = self.compile(
2025-05-07T20:32:28.5779657Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:28.5780455Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5781254Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:28.5782265Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5782859Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5783355Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.5783797Z     | ^
2025-05-07T20:32:28.5784573Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5785528Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:28.5786182Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:28.5786900Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5787595Z     |     T=1,  # or any other generated value
2025-05-07T20:32:28.5788093Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:28.5788642Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:28.5789214Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:28.5789803Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:28.5790287Z     | )
2025-05-07T20:32:28.5790587Z     | 
2025-05-07T20:32:28.5791434Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:28.5792391Z     +------------------------------------
2025-05-07T20:32:28.5792971Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:28.5793719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5794558Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5795207Z     T=1,
2025-05-07T20:32:28.5795502Z     D=5120,
2025-05-07T20:32:28.5795820Z     scale_ub=None,
2025-05-07T20:32:28.5796180Z     contiguous=True,
2025-05-07T20:32:28.5796535Z     compiled=True,
2025-05-07T20:32:28.5796876Z )
2025-05-07T20:32:28.5797390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5798151Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.5798559Z 
2025-05-07T20:32:28.5798691Z     @given(
2025-05-07T20:32:28.5799071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5799544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5800005Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5800494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5800965Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5801406Z     )
2025-05-07T20:32:28.5801958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5802658Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5803062Z         self,
2025-05-07T20:32:28.5803495Z         T: int,
2025-05-07T20:32:28.5803831Z         D: int,
2025-05-07T20:32:28.5804198Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5804629Z         contiguous: bool,
2025-05-07T20:32:28.5804994Z         compiled: bool,
2025-05-07T20:32:28.5805337Z     ) -> None:
2025-05-07T20:32:28.5805659Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5806014Z     
2025-05-07T20:32:28.5806413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5806924Z     
2025-05-07T20:32:28.5807224Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5807665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5808123Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5808491Z         x0 = x[:, :D]
2025-05-07T20:32:28.5808836Z         x1 = x[:, D:]
2025-05-07T20:32:28.5809154Z     
2025-05-07T20:32:28.5809447Z         if contiguous:
2025-05-07T20:32:28.5809806Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5810203Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5810569Z     
2025-05-07T20:32:28.5810870Z         if scale_ub is not None:
2025-05-07T20:32:28.5811284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5811776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5812235Z             )
2025-05-07T20:32:28.5812528Z         else:
2025-05-07T20:32:28.5812844Z             scale_ub_tensor = None
2025-05-07T20:32:28.5813229Z     
2025-05-07T20:32:28.5813581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5814045Z             op = silu_mul_quant
2025-05-07T20:32:28.5814423Z             if compiled:
2025-05-07T20:32:28.5814807Z                 op = torch.compile(op)
2025-05-07T20:32:28.5815252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5815668Z     
2025-05-07T20:32:28.5815972Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.5816427Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.5816906Z     
2025-05-07T20:32:28.5817293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5817832Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.5818299Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.5818805Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.5819378Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5819872Z     
2025-05-07T20:32:28.5820204Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.5820519Z 
2025-05-07T20:32:28.5820687Z moe/activation_test.py:126: 
2025-05-07T20:32:28.5821160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5821697Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.5822340Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5823582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.5825144Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.5826021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5827093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5828146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.5829360Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5830551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.5831757Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5832915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.5834393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.5835331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.5836160Z     fn()
2025-05-07T20:32:28.5836973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.5837903Z     self.fn.run(
2025-05-07T20:32:28.5838645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5839468Z     kernel = self.compile(
2025-05-07T20:32:28.5840324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5841351Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5841949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5842283Z 
2025-05-07T20:32:28.5842572Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a4ce6680>
2025-05-07T20:32:28.5844112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5846158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a4c60550>}
2025-05-07T20:32:28.5848252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5849874Z context = <triton._C.libtriton.ir.context object at 0x7f76d94248f0>
2025-05-07T20:32:28.5850332Z 
2025-05-07T20:32:28.5850601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5851447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5852197Z                            module_map=module_map)
2025-05-07T20:32:28.5852777Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5853354Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.5853790Z E       ^
2025-05-07T20:32:28.5854535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5855253Z 
2025-05-07T20:32:28.5855909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5856896Z 
2025-05-07T20:32:28.5857067Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5857717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5858338Z     T=2048,
2025-05-07T20:32:28.5858648Z     D=5120,
2025-05-07T20:32:28.5858957Z     scale_ub=1200.0,
2025-05-07T20:32:28.5859311Z     contiguous=True,
2025-05-07T20:32:28.5859674Z     compiled=False,
2025-05-07T20:32:28.5860008Z )
2025-05-07T20:32:28.5860513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5861281Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.5861715Z 
2025-05-07T20:32:28.5861840Z     @given(
2025-05-07T20:32:28.5862210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5862698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5863182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5863699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5864188Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5864627Z     )
2025-05-07T20:32:28.5865193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5865961Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5866350Z         self,
2025-05-07T20:32:28.5866681Z         T: int,
2025-05-07T20:32:28.5867008Z         D: int,
2025-05-07T20:32:28.5867319Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5867751Z         contiguous: bool,
2025-05-07T20:32:28.5868148Z         compiled: bool,
2025-05-07T20:32:28.5868526Z     ) -> None:
2025-05-07T20:32:28.5868886Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5869239Z     
2025-05-07T20:32:28.5869670Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5870221Z     
2025-05-07T20:32:28.5870541Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5870998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5871493Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5871887Z         x0 = x[:, :D]
2025-05-07T20:32:28.5872197Z         x1 = x[:, D:]
2025-05-07T20:32:28.5872507Z     
2025-05-07T20:32:28.5872814Z         if contiguous:
2025-05-07T20:32:28.5873187Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5873728Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5874080Z     
2025-05-07T20:32:28.5874358Z         if scale_ub is not None:
2025-05-07T20:32:28.5874738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5875203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5875641Z             )
2025-05-07T20:32:28.5875915Z         else:
2025-05-07T20:32:28.5876279Z             scale_ub_tensor = None
2025-05-07T20:32:28.5876671Z     
2025-05-07T20:32:28.5876999Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5877451Z             op = silu_mul_quant
2025-05-07T20:32:28.5877813Z             if compiled:
2025-05-07T20:32:28.5878175Z                 op = torch.compile(op)
2025-05-07T20:32:28.5878677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5879105Z     
2025-05-07T20:32:28.5879378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5879635Z 
2025-05-07T20:32:28.5879780Z moe/activation_test.py:117: 
2025-05-07T20:32:28.5880203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5880684Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.5881092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5882190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.5883298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.5884133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5885895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5886837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5887604Z     kernel = self.compile(
2025-05-07T20:32:28.5888379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5889345Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5889930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5890258Z 
2025-05-07T20:32:28.5890576Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a5262410>
2025-05-07T20:32:28.5892112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5894093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a53d89d0>}
2025-05-07T20:32:28.5896039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5897695Z context = <triton._C.libtriton.ir.context object at 0x7f76a4d413f0>
2025-05-07T20:32:28.5898135Z 
2025-05-07T20:32:28.5898372Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5899175Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5899901Z                            module_map=module_map)
2025-05-07T20:32:28.5900464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5901002Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.5901419Z E       ^
2025-05-07T20:32:28.5902139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5902825Z 
2025-05-07T20:32:28.5903456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5904254Z 
2025-05-07T20:32:28.5904418Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5905065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5905698Z     T=2048,
2025-05-07T20:32:28.5906001Z     D=5120,
2025-05-07T20:32:28.5906316Z     scale_ub=1200.0,
2025-05-07T20:32:28.5906677Z     contiguous=True,
2025-05-07T20:32:28.5907032Z     compiled=True,
2025-05-07T20:32:28.5907373Z )
2025-05-07T20:32:28.5907882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5908646Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.5909072Z 
2025-05-07T20:32:28.5909199Z     @given(
2025-05-07T20:32:28.5909569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5910055Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5910556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5911090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5911629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5912098Z     )
2025-05-07T20:32:28.5912665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5913374Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5913891Z         self,
2025-05-07T20:32:28.5914216Z         T: int,
2025-05-07T20:32:28.5914546Z         D: int,
2025-05-07T20:32:28.5914899Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5915336Z         contiguous: bool,
2025-05-07T20:32:28.5915707Z         compiled: bool,
2025-05-07T20:32:28.5916051Z     ) -> None:
2025-05-07T20:32:28.5916508Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5916905Z     
2025-05-07T20:32:28.5917334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5917891Z     
2025-05-07T20:32:28.5918219Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5918689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5919169Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5919557Z         x0 = x[:, :D]
2025-05-07T20:32:28.5919905Z         x1 = x[:, D:]
2025-05-07T20:32:28.5920230Z     
2025-05-07T20:32:28.5920525Z         if contiguous:
2025-05-07T20:32:28.5920889Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5921294Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5921686Z     
2025-05-07T20:32:28.5922001Z         if scale_ub is not None:
2025-05-07T20:32:28.5922440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5922974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5923483Z             )
2025-05-07T20:32:28.5924080Z         else:
2025-05-07T20:32:28.5924440Z             scale_ub_tensor = None
2025-05-07T20:32:28.5924853Z     
2025-05-07T20:32:28.5925219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5925943Z             op = silu_mul_quant
2025-05-07T20:32:28.5926354Z             if compiled:
2025-05-07T20:32:28.5926761Z                 op = torch.compile(op)
2025-05-07T20:32:28.5927238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5927686Z     
2025-05-07T20:32:28.5928006Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.5928465Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.5928936Z     
2025-05-07T20:32:28.5929315Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5929835Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.5930304Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.5930810Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.5931391Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5931886Z     
2025-05-07T20:32:28.5932217Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.5932514Z 
2025-05-07T20:32:28.5932671Z moe/activation_test.py:126: 
2025-05-07T20:32:28.5933108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5933625Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.5934145Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.5935348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.5936515Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.5937377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.5938452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.5939468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.5940530Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5941665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.5942817Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.5943923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.5944898Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.5945813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.5946603Z     fn()
2025-05-07T20:32:28.5947557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.5948465Z     self.fn.run(
2025-05-07T20:32:28.5949192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.5950028Z     kernel = self.compile(
2025-05-07T20:32:28.5950847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.5951823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.5952425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.5952779Z 
2025-05-07T20:32:28.5953085Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a5228730>
2025-05-07T20:32:28.5954809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.5956886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a536f1c0>}
2025-05-07T20:32:28.5959043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.5960534Z context = <triton._C.libtriton.ir.context object at 0x7f767d3fe4b0>
2025-05-07T20:32:28.5960960Z 
2025-05-07T20:32:28.5961210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.5962013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.5962686Z                            module_map=module_map)
2025-05-07T20:32:28.5963196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.5963719Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.5964112Z E       ^
2025-05-07T20:32:28.5964763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.5965417Z 
2025-05-07T20:32:28.5966014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.5966748Z 
2025-05-07T20:32:28.5966897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.5967487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.5968064Z     T=16384,
2025-05-07T20:32:28.5968346Z     D=7168,
2025-05-07T20:32:28.5968629Z     scale_ub=1200.0,
2025-05-07T20:32:28.5968942Z     contiguous=False,
2025-05-07T20:32:28.5969286Z     compiled=False,
2025-05-07T20:32:28.5969609Z )
2025-05-07T20:32:28.5970095Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.5970861Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.5971299Z 
2025-05-07T20:32:28.5971431Z     @given(
2025-05-07T20:32:28.5971761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.5972229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.5972667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.5973156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.5973643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.5974081Z     )
2025-05-07T20:32:28.5974595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.5975298Z     def test_silu_mul_quant(
2025-05-07T20:32:28.5975701Z         self,
2025-05-07T20:32:28.5976028Z         T: int,
2025-05-07T20:32:28.5976357Z         D: int,
2025-05-07T20:32:28.5976723Z         scale_ub: Optional[float],
2025-05-07T20:32:28.5977274Z         contiguous: bool,
2025-05-07T20:32:28.5977621Z         compiled: bool,
2025-05-07T20:32:28.5977965Z     ) -> None:
2025-05-07T20:32:28.5978291Z         torch.manual_seed(2025)
2025-05-07T20:32:28.5978631Z     
2025-05-07T20:32:28.5979030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.5979567Z     
2025-05-07T20:32:28.5979848Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.5980284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.5980763Z         x = x_sign * x_clamp
2025-05-07T20:32:28.5981141Z         x0 = x[:, :D]
2025-05-07T20:32:28.5981478Z         x1 = x[:, D:]
2025-05-07T20:32:28.5981819Z     
2025-05-07T20:32:28.5982113Z         if contiguous:
2025-05-07T20:32:28.5982493Z             x0 = x0.contiguous()
2025-05-07T20:32:28.5982917Z             x1 = x1.contiguous()
2025-05-07T20:32:28.5983309Z     
2025-05-07T20:32:28.5983614Z         if scale_ub is not None:
2025-05-07T20:32:28.5984061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.5984596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.5985062Z             )
2025-05-07T20:32:28.5985373Z         else:
2025-05-07T20:32:28.5985706Z             scale_ub_tensor = None
2025-05-07T20:32:28.5986250Z     
2025-05-07T20:32:28.5986606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.5987084Z             op = silu_mul_quant
2025-05-07T20:32:28.5987483Z             if compiled:
2025-05-07T20:32:28.5987880Z                 op = torch.compile(op)
2025-05-07T20:32:28.5988360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.5988844Z     
2025-05-07T20:32:28.5989158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.5989435Z 
2025-05-07T20:32:28.5989598Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6001980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6002503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6002951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6003990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6005020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6005854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6006929Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6007986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6008838Z     kernel = self.compile(
2025-05-07T20:32:28.6009696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6010669Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6011276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6011628Z 
2025-05-07T20:32:28.6011942Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524430>
2025-05-07T20:32:28.6013562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6015625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a4cb7ac0>}
2025-05-07T20:32:28.6017667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6019225Z context = <triton._C.libtriton.ir.context object at 0x7f767d43b130>
2025-05-07T20:32:28.6019662Z 
2025-05-07T20:32:28.6020081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6020903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6021650Z                            module_map=module_map)
2025-05-07T20:32:28.6022222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6022779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6023182Z E       ^
2025-05-07T20:32:28.6024209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6024932Z 
2025-05-07T20:32:28.6025607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6026402Z 
2025-05-07T20:32:28.6026584Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6027265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6027910Z     T=1,
2025-05-07T20:32:28.6028219Z     D=7168,
2025-05-07T20:32:28.6028528Z     scale_ub=None,
2025-05-07T20:32:28.6028889Z     contiguous=True,
2025-05-07T20:32:28.6029263Z     compiled=True,
2025-05-07T20:32:28.6029797Z )
2025-05-07T20:32:28.6030300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6031055Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6031459Z 
2025-05-07T20:32:28.6031588Z     @given(
2025-05-07T20:32:28.6031969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6032462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6032953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6033478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6034126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6034571Z     )
2025-05-07T20:32:28.6035112Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6035794Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6036173Z         self,
2025-05-07T20:32:28.6036480Z         T: int,
2025-05-07T20:32:28.6036799Z         D: int,
2025-05-07T20:32:28.6037161Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6037598Z         contiguous: bool,
2025-05-07T20:32:28.6037969Z         compiled: bool,
2025-05-07T20:32:28.6038348Z     ) -> None:
2025-05-07T20:32:28.6038702Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6039092Z     
2025-05-07T20:32:28.6039532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6040087Z     
2025-05-07T20:32:28.6040401Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6040878Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6041381Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6041749Z         x0 = x[:, :D]
2025-05-07T20:32:28.6042104Z         x1 = x[:, D:]
2025-05-07T20:32:28.6042439Z     
2025-05-07T20:32:28.6042740Z         if contiguous:
2025-05-07T20:32:28.6043119Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6043543Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6043931Z     
2025-05-07T20:32:28.6044259Z         if scale_ub is not None:
2025-05-07T20:32:28.6044704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6045237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6045722Z             )
2025-05-07T20:32:28.6046044Z         else:
2025-05-07T20:32:28.6046383Z             scale_ub_tensor = None
2025-05-07T20:32:28.6046774Z     
2025-05-07T20:32:28.6047138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6047631Z             op = silu_mul_quant
2025-05-07T20:32:28.6048020Z             if compiled:
2025-05-07T20:32:28.6048413Z                 op = torch.compile(op)
2025-05-07T20:32:28.6048874Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6049293Z     
2025-05-07T20:32:28.6049779Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6050193Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6050603Z     
2025-05-07T20:32:28.6050956Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6051494Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6051967Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6052455Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6053025Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6053533Z     
2025-05-07T20:32:28.6053861Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6054181Z 
2025-05-07T20:32:28.6054346Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6054840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6055313Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6055784Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6056902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6057961Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6058905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6059876Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6060850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6061871Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6062931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6064001Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6065035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6066012Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6067014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6067885Z     fn()
2025-05-07T20:32:28.6068763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6069681Z     self.fn.run(
2025-05-07T20:32:28.6070436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6071275Z     kernel = self.compile(
2025-05-07T20:32:28.6072112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6073150Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6073875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6074226Z 
2025-05-07T20:32:28.6074545Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a6e7c040>
2025-05-07T20:32:28.6076232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6078431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76943d30a0>}
2025-05-07T20:32:28.6080390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6081680Z context = <triton._C.libtriton.ir.context object at 0x7f767f028370>
2025-05-07T20:32:28.6082013Z 
2025-05-07T20:32:28.6082222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6082816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6083357Z                            module_map=module_map)
2025-05-07T20:32:28.6083780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6084183Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6084493Z E       ^
2025-05-07T20:32:28.6085030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6085549Z 
2025-05-07T20:32:28.6086033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6086617Z 
2025-05-07T20:32:28.6086738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6087225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6087686Z     T=4096,
2025-05-07T20:32:28.6087902Z     D=5120,
2025-05-07T20:32:28.6088129Z     scale_ub=None,
2025-05-07T20:32:28.6088474Z     contiguous=False,
2025-05-07T20:32:28.6088740Z     compiled=False,
2025-05-07T20:32:28.6088972Z )
2025-05-07T20:32:28.6089340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6089908Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6090219Z 
2025-05-07T20:32:28.6090313Z     @given(
2025-05-07T20:32:28.6090583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6090948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6091298Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6091679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6092066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6092392Z     )
2025-05-07T20:32:28.6092803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6093314Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6093608Z         self,
2025-05-07T20:32:28.6093834Z         T: int,
2025-05-07T20:32:28.6094070Z         D: int,
2025-05-07T20:32:28.6094326Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6094634Z         contiguous: bool,
2025-05-07T20:32:28.6094913Z         compiled: bool,
2025-05-07T20:32:28.6095174Z     ) -> None:
2025-05-07T20:32:28.6095422Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6095716Z     
2025-05-07T20:32:28.6096033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6096421Z     
2025-05-07T20:32:28.6096651Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6096993Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6097344Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6097627Z         x0 = x[:, :D]
2025-05-07T20:32:28.6097879Z         x1 = x[:, D:]
2025-05-07T20:32:28.6098116Z     
2025-05-07T20:32:28.6098361Z         if contiguous:
2025-05-07T20:32:28.6098658Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6098973Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6099248Z     
2025-05-07T20:32:28.6099475Z         if scale_ub is not None:
2025-05-07T20:32:28.6099792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6100170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6100526Z             )
2025-05-07T20:32:28.6100751Z         else:
2025-05-07T20:32:28.6100991Z             scale_ub_tensor = None
2025-05-07T20:32:28.6101284Z     
2025-05-07T20:32:28.6101554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6101910Z             op = silu_mul_quant
2025-05-07T20:32:28.6102201Z             if compiled:
2025-05-07T20:32:28.6102492Z                 op = torch.compile(op)
2025-05-07T20:32:28.6102921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6103245Z     
2025-05-07T20:32:28.6103472Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6103660Z 
2025-05-07T20:32:28.6103778Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6104132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6104534Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6104866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6105649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6106437Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6107063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6107841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6108650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6109307Z     kernel = self.compile(
2025-05-07T20:32:28.6109934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6110803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6111263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6111524Z 
2025-05-07T20:32:28.6111769Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524040>
2025-05-07T20:32:28.6113013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6114669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a7370c10>}
2025-05-07T20:32:28.6116209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6117389Z context = <triton._C.libtriton.ir.context object at 0x7f767f0572f0>
2025-05-07T20:32:28.6117720Z 
2025-05-07T20:32:28.6117921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6118515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6119058Z                            module_map=module_map)
2025-05-07T20:32:28.6119479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6119888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6120184Z E       ^
2025-05-07T20:32:28.6120724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6121241Z 
2025-05-07T20:32:28.6121723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6122317Z 
2025-05-07T20:32:28.6122445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6122917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6123378Z     T=4096,
2025-05-07T20:32:28.6123601Z     D=7168,
2025-05-07T20:32:28.6124160Z     scale_ub=None,
2025-05-07T20:32:28.6124427Z     contiguous=False,
2025-05-07T20:32:28.6124698Z     compiled=False,
2025-05-07T20:32:28.6124929Z )
2025-05-07T20:32:28.6125299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6125873Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6126188Z 
2025-05-07T20:32:28.6126280Z     @given(
2025-05-07T20:32:28.6126714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6127083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6127443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6127829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6128212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6128553Z     )
2025-05-07T20:32:28.6128974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6129480Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6129763Z         self,
2025-05-07T20:32:28.6129986Z         T: int,
2025-05-07T20:32:28.6130225Z         D: int,
2025-05-07T20:32:28.6130480Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6130794Z         contiguous: bool,
2025-05-07T20:32:28.6131077Z         compiled: bool,
2025-05-07T20:32:28.6131337Z     ) -> None:
2025-05-07T20:32:28.6131587Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6131873Z     
2025-05-07T20:32:28.6132195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6132583Z     
2025-05-07T20:32:28.6132814Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6133153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6133650Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6133926Z         x0 = x[:, :D]
2025-05-07T20:32:28.6134183Z         x1 = x[:, D:]
2025-05-07T20:32:28.6134436Z     
2025-05-07T20:32:28.6134653Z         if contiguous:
2025-05-07T20:32:28.6134927Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6135230Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6135506Z     
2025-05-07T20:32:28.6135734Z         if scale_ub is not None:
2025-05-07T20:32:28.6136051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6136434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6136792Z             )
2025-05-07T20:32:28.6137020Z         else:
2025-05-07T20:32:28.6137267Z             scale_ub_tensor = None
2025-05-07T20:32:28.6137559Z     
2025-05-07T20:32:28.6137830Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6138191Z             op = silu_mul_quant
2025-05-07T20:32:28.6138494Z             if compiled:
2025-05-07T20:32:28.6138784Z                 op = torch.compile(op)
2025-05-07T20:32:28.6139129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6139441Z     
2025-05-07T20:32:28.6139670Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6139859Z 
2025-05-07T20:32:28.6139980Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6140318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6140703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6141031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6141816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6142608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6143226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6144007Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6144768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6145380Z     kernel = self.compile(
2025-05-07T20:32:28.6146004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6146757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6147211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6147480Z 
2025-05-07T20:32:28.6147718Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d524c70>
2025-05-07T20:32:28.6149046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6150607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a5ea7520>}
2025-05-07T20:32:28.6152123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6153286Z context = <triton._C.libtriton.ir.context object at 0x7f767f0b22b0>
2025-05-07T20:32:28.6153700Z 
2025-05-07T20:32:28.6153897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6154502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6155042Z                            module_map=module_map)
2025-05-07T20:32:28.6155462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6155872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6156259Z E       ^
2025-05-07T20:32:28.6156792Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6157314Z 
2025-05-07T20:32:28.6157785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6158393Z 
2025-05-07T20:32:28.6158527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6159050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6159519Z     T=128,
2025-05-07T20:32:28.6159743Z     D=7168,
2025-05-07T20:32:28.6159966Z     scale_ub=None,
2025-05-07T20:32:28.6160222Z     contiguous=False,
2025-05-07T20:32:28.6160490Z     compiled=True,
2025-05-07T20:32:28.6160729Z )
2025-05-07T20:32:28.6161098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6161661Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6161974Z 
2025-05-07T20:32:28.6162071Z     @given(
2025-05-07T20:32:28.6162339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6162702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6163060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6163439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6163824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6164157Z     )
2025-05-07T20:32:28.6164558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6165068Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6165352Z         self,
2025-05-07T20:32:28.6165581Z         T: int,
2025-05-07T20:32:28.6165814Z         D: int,
2025-05-07T20:32:28.6166075Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6166391Z         contiguous: bool,
2025-05-07T20:32:28.6166671Z         compiled: bool,
2025-05-07T20:32:28.6166934Z     ) -> None:
2025-05-07T20:32:28.6167195Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6167472Z     
2025-05-07T20:32:28.6167793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6168186Z     
2025-05-07T20:32:28.6168409Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6168749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6169109Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6169387Z         x0 = x[:, :D]
2025-05-07T20:32:28.6169643Z         x1 = x[:, D:]
2025-05-07T20:32:28.6169889Z     
2025-05-07T20:32:28.6170107Z         if contiguous:
2025-05-07T20:32:28.6170382Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6170686Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6170966Z     
2025-05-07T20:32:28.6171288Z         if scale_ub is not None:
2025-05-07T20:32:28.6171615Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6172010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6172372Z             )
2025-05-07T20:32:28.6172608Z         else:
2025-05-07T20:32:28.6172862Z             scale_ub_tensor = None
2025-05-07T20:32:28.6173152Z     
2025-05-07T20:32:28.6173429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6173795Z             op = silu_mul_quant
2025-05-07T20:32:28.6174087Z             if compiled:
2025-05-07T20:32:28.6174380Z                 op = torch.compile(op)
2025-05-07T20:32:28.6174732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6175053Z     
2025-05-07T20:32:28.6175286Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6175630Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6175966Z     
2025-05-07T20:32:28.6176255Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6176645Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6176985Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6177346Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6177854Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6178214Z     
2025-05-07T20:32:28.6178448Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6178682Z 
2025-05-07T20:32:28.6178799Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6179149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6179536Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6179920Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6180825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6181701Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6182326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6183114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6183916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6184748Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6185610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6186471Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6187310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6188049Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6188742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6189344Z     fn()
2025-05-07T20:32:28.6189945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6190610Z     self.fn.run(
2025-05-07T20:32:28.6191151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6191765Z     kernel = self.compile(
2025-05-07T20:32:28.6192386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6193143Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6193675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6193939Z 
2025-05-07T20:32:28.6194305Z self = <triton.compiler.compiler.ASTSource object at 0x7f76a53b5bd0>
2025-05-07T20:32:28.6195538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6197110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76a5ea7370>}
2025-05-07T20:32:28.6198640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6199810Z context = <triton._C.libtriton.ir.context object at 0x7f767cd12e30>
2025-05-07T20:32:28.6200141Z 
2025-05-07T20:32:28.6200340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6200941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6208496Z                            module_map=module_map)
2025-05-07T20:32:28.6208941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6209477Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6209794Z E       ^
2025-05-07T20:32:28.6210335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6210856Z 
2025-05-07T20:32:28.6211342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6211941Z 
2025-05-07T20:32:28.6212065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6212546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6213009Z     T=128,
2025-05-07T20:32:28.6213226Z     D=7168,
2025-05-07T20:32:28.6213463Z     scale_ub=None,
2025-05-07T20:32:28.6213722Z     contiguous=False,
2025-05-07T20:32:28.6213985Z     compiled=False,
2025-05-07T20:32:28.6214232Z )
2025-05-07T20:32:28.6214604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6215174Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6215492Z 
2025-05-07T20:32:28.6215587Z     @given(
2025-05-07T20:32:28.6215863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6216232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6216590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6216979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6217368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6217700Z     )
2025-05-07T20:32:28.6218110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6218631Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6218921Z         self,
2025-05-07T20:32:28.6219154Z         T: int,
2025-05-07T20:32:28.6219393Z         D: int,
2025-05-07T20:32:28.6219649Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6219978Z         contiguous: bool,
2025-05-07T20:32:28.6220265Z         compiled: bool,
2025-05-07T20:32:28.6220525Z     ) -> None:
2025-05-07T20:32:28.6220781Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6221066Z     
2025-05-07T20:32:28.6221381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6221777Z     
2025-05-07T20:32:28.6222008Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6222342Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6222704Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6222991Z         x0 = x[:, :D]
2025-05-07T20:32:28.6223241Z         x1 = x[:, D:]
2025-05-07T20:32:28.6223483Z     
2025-05-07T20:32:28.6223701Z         if contiguous:
2025-05-07T20:32:28.6224427Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6224738Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6225018Z     
2025-05-07T20:32:28.6225244Z         if scale_ub is not None:
2025-05-07T20:32:28.6225553Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6225947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6226307Z             )
2025-05-07T20:32:28.6226528Z         else:
2025-05-07T20:32:28.6226777Z             scale_ub_tensor = None
2025-05-07T20:32:28.6227069Z     
2025-05-07T20:32:28.6227337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6227702Z             op = silu_mul_quant
2025-05-07T20:32:28.6227995Z             if compiled:
2025-05-07T20:32:28.6228278Z                 op = torch.compile(op)
2025-05-07T20:32:28.6228624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6228945Z     
2025-05-07T20:32:28.6229166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6229362Z 
2025-05-07T20:32:28.6229483Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6229824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6230209Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6230532Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6231468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6232261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6232874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6233720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6234481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6235092Z     kernel = self.compile(
2025-05-07T20:32:28.6235718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6236472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6236928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6237199Z 
2025-05-07T20:32:28.6237443Z self = <triton.compiler.compiler.ASTSource object at 0x7f767d0e35e0>
2025-05-07T20:32:28.6238669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6240227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7694293640>}
2025-05-07T20:32:28.6241751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6242919Z context = <triton._C.libtriton.ir.context object at 0x7f767d29f930>
2025-05-07T20:32:28.6243249Z 
2025-05-07T20:32:28.6243447Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6244047Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6244585Z                            module_map=module_map)
2025-05-07T20:32:28.6245006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6245408Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6245709Z E       ^
2025-05-07T20:32:28.6246242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6246755Z 
2025-05-07T20:32:28.6247323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6247921Z 
2025-05-07T20:32:28.6248042Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6248518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6248983Z     T=4096,
2025-05-07T20:32:28.6249198Z     D=5120,
2025-05-07T20:32:28.6249424Z     scale_ub=1200.0,
2025-05-07T20:32:28.6249685Z     contiguous=True,
2025-05-07T20:32:28.6249939Z     compiled=False,
2025-05-07T20:32:28.6250178Z )
2025-05-07T20:32:28.6250549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6251110Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.6251427Z 
2025-05-07T20:32:28.6251516Z     @given(
2025-05-07T20:32:28.6251788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6252148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6252498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6252887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6253273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6253601Z     )
2025-05-07T20:32:28.6254009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6254606Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6254884Z         self,
2025-05-07T20:32:28.6255114Z         T: int,
2025-05-07T20:32:28.6255345Z         D: int,
2025-05-07T20:32:28.6255596Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6255910Z         contiguous: bool,
2025-05-07T20:32:28.6256188Z         compiled: bool,
2025-05-07T20:32:28.6256442Z     ) -> None:
2025-05-07T20:32:28.6256693Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6256974Z     
2025-05-07T20:32:28.6257287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6257674Z     
2025-05-07T20:32:28.6257902Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6258243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6258598Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6258879Z         x0 = x[:, :D]
2025-05-07T20:32:28.6259134Z         x1 = x[:, D:]
2025-05-07T20:32:28.6259381Z     
2025-05-07T20:32:28.6259597Z         if contiguous:
2025-05-07T20:32:28.6259865Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6260161Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6260441Z     
2025-05-07T20:32:28.6260663Z         if scale_ub is not None:
2025-05-07T20:32:28.6260977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6261363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6261720Z             )
2025-05-07T20:32:28.6261940Z         else:
2025-05-07T20:32:28.6262182Z             scale_ub_tensor = None
2025-05-07T20:32:28.6262473Z     
2025-05-07T20:32:28.6262738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6263105Z             op = silu_mul_quant
2025-05-07T20:32:28.6263403Z             if compiled:
2025-05-07T20:32:28.6263694Z                 op = torch.compile(op)
2025-05-07T20:32:28.6264033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6264356Z     
2025-05-07T20:32:28.6264591Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6264783Z 
2025-05-07T20:32:28.6264899Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6265244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6265632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6265952Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6266735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6267523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6268143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6269030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6269794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6270413Z     kernel = self.compile(
2025-05-07T20:32:28.6271028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6271778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6272233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6272496Z 
2025-05-07T20:32:28.6272737Z self = <triton.compiler.compiler.ASTSource object at 0x7f767ef4cca0>
2025-05-07T20:32:28.6274005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6275569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7694290670>}
2025-05-07T20:32:28.6277276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6278439Z context = <triton._C.libtriton.ir.context object at 0x7f767c81e070>
2025-05-07T20:32:28.6278767Z 
2025-05-07T20:32:28.6278965Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6279555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6280093Z                            module_map=module_map)
2025-05-07T20:32:28.6280513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6280911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6281221Z E       ^
2025-05-07T20:32:28.6281754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6282270Z 
2025-05-07T20:32:28.6282759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6283343Z 
2025-05-07T20:32:28.6283463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6283939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6284399Z     T=1,
2025-05-07T20:32:28.6284610Z     D=5120,
2025-05-07T20:32:28.6284837Z     scale_ub=None,
2025-05-07T20:32:28.6285084Z     contiguous=True,
2025-05-07T20:32:28.6285342Z     compiled=True,
2025-05-07T20:32:28.6285573Z )
2025-05-07T20:32:28.6285936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6286496Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6286790Z 
2025-05-07T20:32:28.6286879Z     @given(
2025-05-07T20:32:28.6287144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6287506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6287872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6288250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6288628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6288954Z     )
2025-05-07T20:32:28.6289353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6289855Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6290129Z         self,
2025-05-07T20:32:28.6290348Z         T: int,
2025-05-07T20:32:28.6290572Z         D: int,
2025-05-07T20:32:28.6290830Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6290938Z         contiguous: bool,
2025-05-07T20:32:28.6291040Z         compiled: bool,
2025-05-07T20:32:28.6291261Z     ) -> None:
2025-05-07T20:32:28.6291375Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6291462Z     
2025-05-07T20:32:28.6291666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6291759Z     
2025-05-07T20:32:28.6291868Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6292022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6292126Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6292221Z         x0 = x[:, :D]
2025-05-07T20:32:28.6292322Z         x1 = x[:, D:]
2025-05-07T20:32:28.6292411Z     
2025-05-07T20:32:28.6292517Z         if contiguous:
2025-05-07T20:32:28.6292623Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6292728Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6292821Z     
2025-05-07T20:32:28.6292927Z         if scale_ub is not None:
2025-05-07T20:32:28.6293050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6293213Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6293309Z             )
2025-05-07T20:32:28.6293398Z         else:
2025-05-07T20:32:28.6293517Z             scale_ub_tensor = None
2025-05-07T20:32:28.6293602Z     
2025-05-07T20:32:28.6293755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6293959Z             op = silu_mul_quant
2025-05-07T20:32:28.6294060Z             if compiled:
2025-05-07T20:32:28.6294185Z                 op = torch.compile(op)
2025-05-07T20:32:28.6294308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6294394Z     
2025-05-07T20:32:28.6294509Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6294650Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6294736Z     
2025-05-07T20:32:28.6294902Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6295022Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6295139Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6295293Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6295456Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6295543Z     
2025-05-07T20:32:28.6295669Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6295680Z 
2025-05-07T20:32:28.6295796Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6295955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6296082Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6296239Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6296882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6297002Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6297416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6297687Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6298107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6298409Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6298872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6299167Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6299599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6299793Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6300193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6300285Z     fn()
2025-05-07T20:32:28.6300832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6300939Z     self.fn.run(
2025-05-07T20:32:28.6301327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6301444Z     kernel = self.compile(
2025-05-07T20:32:28.6301886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6302092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6302250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6302255Z 
2025-05-07T20:32:28.6302492Z self = <triton.compiler.compiler.ASTSource object at 0x7f767ee14520>
2025-05-07T20:32:28.6303384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6303967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767f26d480>}
2025-05-07T20:32:28.6304903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6305133Z context = <triton._C.libtriton.ir.context object at 0x7f767c7d24b0>
2025-05-07T20:32:28.6305138Z 
2025-05-07T20:32:28.6305331Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6305643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6305769Z                            module_map=module_map)
2025-05-07T20:32:28.6305966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6306092Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6306183Z E       ^
2025-05-07T20:32:28.6306595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6306606Z 
2025-05-07T20:32:28.6307088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6307094Z 
2025-05-07T20:32:28.6307216Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6307481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6307573Z     T=2048,
2025-05-07T20:32:28.6307666Z     D=5120,
2025-05-07T20:32:28.6307772Z     scale_ub=None,
2025-05-07T20:32:28.6307873Z     contiguous=True,
2025-05-07T20:32:28.6307972Z     compiled=True,
2025-05-07T20:32:28.6308066Z )
2025-05-07T20:32:28.6308324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6308524Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6308537Z 
2025-05-07T20:32:28.6308628Z     @given(
2025-05-07T20:32:28.6308775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6308904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6309040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6309179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6309319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6309409Z     )
2025-05-07T20:32:28.6309694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6309811Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6309902Z         self,
2025-05-07T20:32:28.6309994Z         T: int,
2025-05-07T20:32:28.6310093Z         D: int,
2025-05-07T20:32:28.6310208Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6310409Z         contiguous: bool,
2025-05-07T20:32:28.6310512Z         compiled: bool,
2025-05-07T20:32:28.6310605Z     ) -> None:
2025-05-07T20:32:28.6310722Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6310810Z     
2025-05-07T20:32:28.6311012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6311107Z     
2025-05-07T20:32:28.6311216Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6311362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6311473Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6311568Z         x0 = x[:, :D]
2025-05-07T20:32:28.6311662Z         x1 = x[:, D:]
2025-05-07T20:32:28.6311753Z     
2025-05-07T20:32:28.6311852Z         if contiguous:
2025-05-07T20:32:28.6311964Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6312068Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6312154Z     
2025-05-07T20:32:28.6312267Z         if scale_ub is not None:
2025-05-07T20:32:28.6312397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6312553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6312648Z             )
2025-05-07T20:32:28.6312737Z         else:
2025-05-07T20:32:28.6312850Z             scale_ub_tensor = None
2025-05-07T20:32:28.6313032Z     
2025-05-07T20:32:28.6313185Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6313291Z             op = silu_mul_quant
2025-05-07T20:32:28.6313397Z             if compiled:
2025-05-07T20:32:28.6313609Z                 op = torch.compile(op)
2025-05-07T20:32:28.6313740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6313826Z     
2025-05-07T20:32:28.6313932Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6314080Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6314168Z     
2025-05-07T20:32:28.6314326Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6314453Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6314577Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6314719Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6314893Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6314987Z     
2025-05-07T20:32:28.6315106Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6315118Z 
2025-05-07T20:32:28.6315234Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6315384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6315515Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6315674Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6316312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6316442Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6316861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6317126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6317547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6317847Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6318308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6318601Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6319030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6319231Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6319740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6319839Z     fn()
2025-05-07T20:32:28.6320301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6320404Z     self.fn.run(
2025-05-07T20:32:28.6320804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6320916Z     kernel = self.compile(
2025-05-07T20:32:28.6321352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6321563Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6321711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6321716Z 
2025-05-07T20:32:28.6321956Z self = <triton.compiler.compiler.ASTSource object at 0x7f769444afe0>
2025-05-07T20:32:28.6322850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6323546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767d2ed000>}
2025-05-07T20:32:28.6324727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6324959Z context = <triton._C.libtriton.ir.context object at 0x7f767c620870>
2025-05-07T20:32:28.6324964Z 
2025-05-07T20:32:28.6325169Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6325474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6325617Z                            module_map=module_map)
2025-05-07T20:32:28.6325805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6325927Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6326036Z E       ^
2025-05-07T20:32:28.6326445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6326450Z 
2025-05-07T20:32:28.6326925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6326938Z 
2025-05-07T20:32:28.6327063Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6327321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6327419Z     T=128,
2025-05-07T20:32:28.6327510Z     D=5120,
2025-05-07T20:32:28.6327608Z     scale_ub=None,
2025-05-07T20:32:28.6327715Z     contiguous=True,
2025-05-07T20:32:28.6327820Z     compiled=True,
2025-05-07T20:32:28.6327907Z )
2025-05-07T20:32:28.6328168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6328363Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6328374Z 
2025-05-07T20:32:28.6328470Z     @given(
2025-05-07T20:32:28.6328611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6328729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6328878Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6329020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6329158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6329254Z     )
2025-05-07T20:32:28.6329539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6329652Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6329754Z         self,
2025-05-07T20:32:28.6329845Z         T: int,
2025-05-07T20:32:28.6330097Z         D: int,
2025-05-07T20:32:28.6330223Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6330328Z         contiguous: bool,
2025-05-07T20:32:28.6330435Z         compiled: bool,
2025-05-07T20:32:28.6330527Z     ) -> None:
2025-05-07T20:32:28.6330642Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6330738Z     
2025-05-07T20:32:28.6330938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6331027Z     
2025-05-07T20:32:28.6331144Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6331289Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6331394Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6331497Z         x0 = x[:, :D]
2025-05-07T20:32:28.6331592Z         x1 = x[:, D:]
2025-05-07T20:32:28.6331678Z     
2025-05-07T20:32:28.6331790Z         if contiguous:
2025-05-07T20:32:28.6331898Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6332008Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6332094Z     
2025-05-07T20:32:28.6332205Z         if scale_ub is not None:
2025-05-07T20:32:28.6332337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6332494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6332712Z             )
2025-05-07T20:32:28.6332807Z         else:
2025-05-07T20:32:28.6332919Z             scale_ub_tensor = None
2025-05-07T20:32:28.6333004Z     
2025-05-07T20:32:28.6333170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6333276Z             op = silu_mul_quant
2025-05-07T20:32:28.6333376Z             if compiled:
2025-05-07T20:32:28.6333500Z                 op = torch.compile(op)
2025-05-07T20:32:28.6333627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6333712Z     
2025-05-07T20:32:28.6333826Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6333967Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6334059Z     
2025-05-07T20:32:28.6334223Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6334343Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6334466Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6334608Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6334780Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6334875Z     
2025-05-07T20:32:28.6334994Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6335000Z 
2025-05-07T20:32:28.6335120Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6335271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6335395Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6335558Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6336313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6336491Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6336950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6337279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6337746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6338042Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6338501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6338797Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6339226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6339542Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6339941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6340032Z     fn()
2025-05-07T20:32:28.6340501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6340598Z     self.fn.run(
2025-05-07T20:32:28.6340986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6341101Z     kernel = self.compile(
2025-05-07T20:32:28.6341536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6341746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6341895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6341901Z 
2025-05-07T20:32:28.6342144Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbe0d60>
2025-05-07T20:32:28.6343039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6343704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca01360>}
2025-05-07T20:32:28.6344559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6344785Z context = <triton._C.libtriton.ir.context object at 0x7f7677f97370>
2025-05-07T20:32:28.6344790Z 
2025-05-07T20:32:28.6344982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6345297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6345424Z                            module_map=module_map)
2025-05-07T20:32:28.6345620Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6345748Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6345841Z E       ^
2025-05-07T20:32:28.6346256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6346261Z 
2025-05-07T20:32:28.6346738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6346743Z 
2025-05-07T20:32:28.6346875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6347132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6347224Z     T=4096,
2025-05-07T20:32:28.6347325Z     D=5120,
2025-05-07T20:32:28.6347427Z     scale_ub=None,
2025-05-07T20:32:28.6347527Z     contiguous=True,
2025-05-07T20:32:28.6347632Z     compiled=True,
2025-05-07T20:32:28.6347718Z )
2025-05-07T20:32:28.6347968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6348186Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6348191Z 
2025-05-07T20:32:28.6348284Z     @given(
2025-05-07T20:32:28.6348431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6348550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6348689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6348835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6348969Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6349058Z     )
2025-05-07T20:32:28.6349349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6349551Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6349643Z         self,
2025-05-07T20:32:28.6349743Z         T: int,
2025-05-07T20:32:28.6349832Z         D: int,
2025-05-07T20:32:28.6349947Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6350066Z         contiguous: bool,
2025-05-07T20:32:28.6350168Z         compiled: bool,
2025-05-07T20:32:28.6350265Z     ) -> None:
2025-05-07T20:32:28.6350375Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6350461Z     
2025-05-07T20:32:28.6350662Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6350749Z     
2025-05-07T20:32:28.6350856Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6351006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6351110Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6351204Z         x0 = x[:, :D]
2025-05-07T20:32:28.6351303Z         x1 = x[:, D:]
2025-05-07T20:32:28.6351388Z     
2025-05-07T20:32:28.6351485Z         if contiguous:
2025-05-07T20:32:28.6351603Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6351709Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6351793Z     
2025-05-07T20:32:28.6351906Z         if scale_ub is not None:
2025-05-07T20:32:28.6352030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6352288Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6352380Z             )
2025-05-07T20:32:28.6352472Z         else:
2025-05-07T20:32:28.6352588Z             scale_ub_tensor = None
2025-05-07T20:32:28.6352673Z     
2025-05-07T20:32:28.6352823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6352935Z             op = silu_mul_quant
2025-05-07T20:32:28.6353034Z             if compiled:
2025-05-07T20:32:28.6353153Z                 op = torch.compile(op)
2025-05-07T20:32:28.6353283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6353368Z     
2025-05-07T20:32:28.6353475Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6353720Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6353807Z     
2025-05-07T20:32:28.6353974Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6354092Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6354219Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6354365Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6354526Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6354612Z     
2025-05-07T20:32:28.6354736Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6354741Z 
2025-05-07T20:32:28.6354857Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6355013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6355137Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6355295Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6355943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6356063Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6356475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6356746Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6357164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6357464Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6357918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6358210Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6358741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6358954Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6359422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6359520Z     fn()
2025-05-07T20:32:28.6360071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6360176Z     self.fn.run(
2025-05-07T20:32:28.6360564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6360674Z     kernel = self.compile(
2025-05-07T20:32:28.6361116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6361322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6361480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6361486Z 
2025-05-07T20:32:28.6361721Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbe31c0>
2025-05-07T20:32:28.6362604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6363311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767c4130a0>}
2025-05-07T20:32:28.6364157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6364382Z context = <triton._C.libtriton.ir.context object at 0x7f7677971ef0>
2025-05-07T20:32:28.6364387Z 
2025-05-07T20:32:28.6364584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6364887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6365026Z                            module_map=module_map)
2025-05-07T20:32:28.6365213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6365341Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6365433Z E       ^
2025-05-07T20:32:28.6365840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6371584Z 
2025-05-07T20:32:28.6372099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6372105Z 
2025-05-07T20:32:28.6372228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6372505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6372596Z     T=16384,
2025-05-07T20:32:28.6372685Z     D=5120,
2025-05-07T20:32:28.6372787Z     scale_ub=None,
2025-05-07T20:32:28.6372884Z     contiguous=True,
2025-05-07T20:32:28.6372981Z     compiled=True,
2025-05-07T20:32:28.6373076Z )
2025-05-07T20:32:28.6373327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6373534Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6373539Z 
2025-05-07T20:32:28.6373629Z     @given(
2025-05-07T20:32:28.6373767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6373891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6374021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6374154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6374292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6374381Z     )
2025-05-07T20:32:28.6374786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6374897Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6374987Z         self,
2025-05-07T20:32:28.6375079Z         T: int,
2025-05-07T20:32:28.6375172Z         D: int,
2025-05-07T20:32:28.6375286Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6375396Z         contiguous: bool,
2025-05-07T20:32:28.6375497Z         compiled: bool,
2025-05-07T20:32:28.6375589Z     ) -> None:
2025-05-07T20:32:28.6375703Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6375787Z     
2025-05-07T20:32:28.6375984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6376073Z     
2025-05-07T20:32:28.6376180Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6376323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6376431Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6376522Z         x0 = x[:, :D]
2025-05-07T20:32:28.6376623Z         x1 = x[:, D:]
2025-05-07T20:32:28.6376712Z     
2025-05-07T20:32:28.6376809Z         if contiguous:
2025-05-07T20:32:28.6376922Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6377026Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6377111Z     
2025-05-07T20:32:28.6377314Z         if scale_ub is not None:
2025-05-07T20:32:28.6377438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6377594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6377689Z             )
2025-05-07T20:32:28.6377777Z         else:
2025-05-07T20:32:28.6377884Z             scale_ub_tensor = None
2025-05-07T20:32:28.6377973Z     
2025-05-07T20:32:28.6378126Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6378233Z             op = silu_mul_quant
2025-05-07T20:32:28.6378333Z             if compiled:
2025-05-07T20:32:28.6378450Z                 op = torch.compile(op)
2025-05-07T20:32:28.6378579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6378661Z     
2025-05-07T20:32:28.6378771Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6378914Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6379000Z     
2025-05-07T20:32:28.6379157Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6379285Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6379400Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6379541Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6379707Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6379791Z     
2025-05-07T20:32:28.6379912Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6379917Z 
2025-05-07T20:32:28.6380033Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6380181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6380310Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6380473Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6381116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6381246Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6381660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6381923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6382341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6382640Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6383096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6383474Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6383907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6384100Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6384503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6384594Z     fn()
2025-05-07T20:32:28.6385052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6385152Z     self.fn.run(
2025-05-07T20:32:28.6385542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6385656Z     kernel = self.compile(
2025-05-07T20:32:28.6386096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6386302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6386453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6386458Z 
2025-05-07T20:32:28.6386693Z self = <triton.compiler.compiler.ASTSource object at 0x7f767781f070>
2025-05-07T20:32:28.6387676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6388255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca01b40>}
2025-05-07T20:32:28.6389103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6389332Z context = <triton._C.libtriton.ir.context object at 0x7f76773b5370>
2025-05-07T20:32:28.6389337Z 
2025-05-07T20:32:28.6389528Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6389837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6389965Z                            module_map=module_map)
2025-05-07T20:32:28.6390151Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6390273Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6390361Z E       ^
2025-05-07T20:32:28.6390765Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6390770Z 
2025-05-07T20:32:28.6391242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6391247Z 
2025-05-07T20:32:28.6391369Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6391626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6391716Z     T=1,
2025-05-07T20:32:28.6391804Z     D=5120,
2025-05-07T20:32:28.6391910Z     scale_ub=1200.0,
2025-05-07T20:32:28.6392009Z     contiguous=True,
2025-05-07T20:32:28.6392106Z     compiled=True,
2025-05-07T20:32:28.6392194Z )
2025-05-07T20:32:28.6392441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6392630Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6392636Z 
2025-05-07T20:32:28.6392729Z     @given(
2025-05-07T20:32:28.6392867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6392982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6393118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6393258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6393480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6393673Z     )
2025-05-07T20:32:28.6393960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6394074Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6394169Z         self,
2025-05-07T20:32:28.6394257Z         T: int,
2025-05-07T20:32:28.6394349Z         D: int,
2025-05-07T20:32:28.6394461Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6394563Z         contiguous: bool,
2025-05-07T20:32:28.6394664Z         compiled: bool,
2025-05-07T20:32:28.6394754Z     ) -> None:
2025-05-07T20:32:28.6394869Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6394953Z     
2025-05-07T20:32:28.6395146Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6395236Z     
2025-05-07T20:32:28.6395340Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6395484Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6395590Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6395689Z         x0 = x[:, :D]
2025-05-07T20:32:28.6395782Z         x1 = x[:, D:]
2025-05-07T20:32:28.6395868Z     
2025-05-07T20:32:28.6395963Z         if contiguous:
2025-05-07T20:32:28.6396066Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6396271Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6396355Z     
2025-05-07T20:32:28.6396463Z         if scale_ub is not None:
2025-05-07T20:32:28.6396591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6396747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6396838Z             )
2025-05-07T20:32:28.6396926Z         else:
2025-05-07T20:32:28.6397035Z             scale_ub_tensor = None
2025-05-07T20:32:28.6397123Z     
2025-05-07T20:32:28.6397273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6397378Z             op = silu_mul_quant
2025-05-07T20:32:28.6397481Z             if compiled:
2025-05-07T20:32:28.6397596Z                 op = torch.compile(op)
2025-05-07T20:32:28.6397723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6397811Z     
2025-05-07T20:32:28.6397916Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6397921Z 
2025-05-07T20:32:28.6398035Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6398190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6398306Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6398425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6398845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6398954Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6399524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6399636Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6400054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6400311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6400701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6400819Z     kernel = self.compile(
2025-05-07T20:32:28.6401256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6401459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6401608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6401613Z 
2025-05-07T20:32:28.6401844Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677646080>
2025-05-07T20:32:28.6402846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6403424Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767cc3d480>}
2025-05-07T20:32:28.6404279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6404498Z context = <triton._C.libtriton.ir.context object at 0x7f7676e588f0>
2025-05-07T20:32:28.6404503Z 
2025-05-07T20:32:28.6404692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6405000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6405124Z                            module_map=module_map)
2025-05-07T20:32:28.6405318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6405437Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6405525Z E       ^
2025-05-07T20:32:28.6405935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6406025Z 
2025-05-07T20:32:28.6406496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6406502Z 
2025-05-07T20:32:28.6406622Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6406881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6406973Z     T=1,
2025-05-07T20:32:28.6407069Z     D=5120,
2025-05-07T20:32:28.6407164Z     scale_ub=None,
2025-05-07T20:32:28.6407263Z     contiguous=False,
2025-05-07T20:32:28.6407362Z     compiled=True,
2025-05-07T20:32:28.6407445Z )
2025-05-07T20:32:28.6407697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6407889Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6407894Z 
2025-05-07T20:32:28.6407983Z     @given(
2025-05-07T20:32:28.6408119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6408244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6408381Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6408521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6408654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6408739Z     )
2025-05-07T20:32:28.6409025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6409132Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6409221Z         self,
2025-05-07T20:32:28.6409314Z         T: int,
2025-05-07T20:32:28.6409401Z         D: int,
2025-05-07T20:32:28.6409514Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6409621Z         contiguous: bool,
2025-05-07T20:32:28.6409725Z         compiled: bool,
2025-05-07T20:32:28.6409816Z     ) -> None:
2025-05-07T20:32:28.6409928Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6410011Z     
2025-05-07T20:32:28.6410211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6410300Z     
2025-05-07T20:32:28.6410409Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6410555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6410657Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6410748Z         x0 = x[:, :D]
2025-05-07T20:32:28.6410846Z         x1 = x[:, D:]
2025-05-07T20:32:28.6410929Z     
2025-05-07T20:32:28.6411024Z         if contiguous:
2025-05-07T20:32:28.6411131Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6411232Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6411316Z     
2025-05-07T20:32:28.6411422Z         if scale_ub is not None:
2025-05-07T20:32:28.6411542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6411788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6411882Z             )
2025-05-07T20:32:28.6411969Z         else:
2025-05-07T20:32:28.6412080Z             scale_ub_tensor = None
2025-05-07T20:32:28.6412169Z     
2025-05-07T20:32:28.6412316Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6412424Z             op = silu_mul_quant
2025-05-07T20:32:28.6412521Z             if compiled:
2025-05-07T20:32:28.6412634Z                 op = torch.compile(op)
2025-05-07T20:32:28.6412761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6412843Z     
2025-05-07T20:32:28.6412952Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6413095Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6413177Z     
2025-05-07T20:32:28.6413332Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6413455Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6413575Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6413719Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6413880Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6414055Z     
2025-05-07T20:32:28.6414173Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6414178Z 
2025-05-07T20:32:28.6414290Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6414435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6414560Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6414715Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6415355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6415473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6415889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6416150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6416568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6416870Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6417325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6417616Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6418043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6418235Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6418653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6418746Z     fn()
2025-05-07T20:32:28.6419199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6419303Z     self.fn.run(
2025-05-07T20:32:28.6419686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6419795Z     kernel = self.compile(
2025-05-07T20:32:28.6420226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6420425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6420570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6420575Z 
2025-05-07T20:32:28.6420808Z self = <triton.compiler.compiler.ASTSource object at 0x7f76776d7940>
2025-05-07T20:32:28.6421772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6422351Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea89d0>}
2025-05-07T20:32:28.6423190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6423411Z context = <triton._C.libtriton.ir.context object at 0x7f7676e281b0>
2025-05-07T20:32:28.6423416Z 
2025-05-07T20:32:28.6423604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6424171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6424334Z                            module_map=module_map)
2025-05-07T20:32:28.6424524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6424643Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6424903Z E       ^
2025-05-07T20:32:28.6425308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6425313Z 
2025-05-07T20:32:28.6425783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6425788Z 
2025-05-07T20:32:28.6425907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6426159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6426250Z     T=1,
2025-05-07T20:32:28.6426339Z     D=5120,
2025-05-07T20:32:28.6426433Z     scale_ub=None,
2025-05-07T20:32:28.6426534Z     contiguous=True,
2025-05-07T20:32:28.6426629Z     compiled=False,
2025-05-07T20:32:28.6426723Z )
2025-05-07T20:32:28.6426969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6427154Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.6427165Z 
2025-05-07T20:32:28.6427256Z     @given(
2025-05-07T20:32:28.6427392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6427506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6427641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6427775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6427909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6427995Z     )
2025-05-07T20:32:28.6428276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6428388Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6428475Z         self,
2025-05-07T20:32:28.6428565Z         T: int,
2025-05-07T20:32:28.6428654Z         D: int,
2025-05-07T20:32:28.6428771Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6428872Z         contiguous: bool,
2025-05-07T20:32:28.6428973Z         compiled: bool,
2025-05-07T20:32:28.6429063Z     ) -> None:
2025-05-07T20:32:28.6429175Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6429261Z     
2025-05-07T20:32:28.6429452Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6429537Z     
2025-05-07T20:32:28.6429647Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6429790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6429895Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6429986Z         x0 = x[:, :D]
2025-05-07T20:32:28.6430077Z         x1 = x[:, D:]
2025-05-07T20:32:28.6430163Z     
2025-05-07T20:32:28.6430258Z         if contiguous:
2025-05-07T20:32:28.6430362Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6430468Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6430551Z     
2025-05-07T20:32:28.6430793Z         if scale_ub is not None:
2025-05-07T20:32:28.6430922Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6431076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6431163Z             )
2025-05-07T20:32:28.6431260Z         else:
2025-05-07T20:32:28.6431367Z             scale_ub_tensor = None
2025-05-07T20:32:28.6431451Z     
2025-05-07T20:32:28.6431599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6431701Z             op = silu_mul_quant
2025-05-07T20:32:28.6431805Z             if compiled:
2025-05-07T20:32:28.6431918Z                 op = torch.compile(op)
2025-05-07T20:32:28.6432040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6432127Z     
2025-05-07T20:32:28.6432232Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6432237Z 
2025-05-07T20:32:28.6432347Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6432498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6432617Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6432738Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6433303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6433500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6433989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6434275Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6434727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6434840Z     kernel = self.compile(
2025-05-07T20:32:28.6435353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6435578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6435733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6435738Z 
2025-05-07T20:32:28.6435993Z self = <triton.compiler.compiler.ASTSource object at 0x7f76778f7f10>
2025-05-07T20:32:28.6437083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6437776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767cc3d1b0>}
2025-05-07T20:32:28.6438860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6439109Z context = <triton._C.libtriton.ir.context object at 0x7f7676e4ce70>
2025-05-07T20:32:28.6439114Z 
2025-05-07T20:32:28.6439315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6439661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6439791Z                            module_map=module_map)
2025-05-07T20:32:28.6439990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6440105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6440192Z E       ^
2025-05-07T20:32:28.6440670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6440675Z 
2025-05-07T20:32:28.6441230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6441235Z 
2025-05-07T20:32:28.6441359Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6441843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6441933Z     T=128,
2025-05-07T20:32:28.6442022Z     D=5120,
2025-05-07T20:32:28.6442116Z     scale_ub=None,
2025-05-07T20:32:28.6442219Z     contiguous=False,
2025-05-07T20:32:28.6442315Z     compiled=True,
2025-05-07T20:32:28.6442398Z )
2025-05-07T20:32:28.6442643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6442839Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6442844Z 
2025-05-07T20:32:28.6442934Z     @given(
2025-05-07T20:32:28.6443078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6443193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6443327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6443469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6443601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6443694Z     )
2025-05-07T20:32:28.6443984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6444094Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6444277Z         self,
2025-05-07T20:32:28.6444375Z         T: int,
2025-05-07T20:32:28.6444463Z         D: int,
2025-05-07T20:32:28.6444577Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6444687Z         contiguous: bool,
2025-05-07T20:32:28.6444786Z         compiled: bool,
2025-05-07T20:32:28.6444885Z     ) -> None:
2025-05-07T20:32:28.6444995Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6445079Z     
2025-05-07T20:32:28.6445280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6445365Z     
2025-05-07T20:32:28.6445471Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6445621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6445727Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6445828Z         x0 = x[:, :D]
2025-05-07T20:32:28.6445928Z         x1 = x[:, D:]
2025-05-07T20:32:28.6446011Z     
2025-05-07T20:32:28.6446107Z         if contiguous:
2025-05-07T20:32:28.6446219Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6446326Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6446417Z     
2025-05-07T20:32:28.6446522Z         if scale_ub is not None:
2025-05-07T20:32:28.6446644Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6446805Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6446895Z             )
2025-05-07T20:32:28.6446986Z         else:
2025-05-07T20:32:28.6447100Z             scale_ub_tensor = None
2025-05-07T20:32:28.6447185Z     
2025-05-07T20:32:28.6447335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6447444Z             op = silu_mul_quant
2025-05-07T20:32:28.6447543Z             if compiled:
2025-05-07T20:32:28.6447658Z                 op = torch.compile(op)
2025-05-07T20:32:28.6447790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6447877Z     
2025-05-07T20:32:28.6447988Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6447993Z 
2025-05-07T20:32:28.6448105Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6448264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6448386Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6448503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6448923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6449037Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6449600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6449719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6450216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6450474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6450866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6450978Z     kernel = self.compile(
2025-05-07T20:32:28.6451412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6451622Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6451766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6451771Z 
2025-05-07T20:32:28.6452010Z self = <triton.compiler.compiler.ASTSource object at 0x7f767cbc4fd0>
2025-05-07T20:32:28.6452891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6453459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677e27910>}
2025-05-07T20:32:28.6454391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6454609Z context = <triton._C.libtriton.ir.context object at 0x7f767706a3f0>
2025-05-07T20:32:28.6454614Z 
2025-05-07T20:32:28.6454811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6455113Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6455241Z                            module_map=module_map)
2025-05-07T20:32:28.6455426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6455545Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6455639Z E       ^
2025-05-07T20:32:28.6456043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6456055Z 
2025-05-07T20:32:28.6456523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6456528Z 
2025-05-07T20:32:28.6456656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6456909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6457009Z     T=128,
2025-05-07T20:32:28.6457098Z     D=7168,
2025-05-07T20:32:28.6457195Z     scale_ub=1200.0,
2025-05-07T20:32:28.6457303Z     contiguous=False,
2025-05-07T20:32:28.6457401Z     compiled=False,
2025-05-07T20:32:28.6457487Z )
2025-05-07T20:32:28.6457743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6457945Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6457950Z 
2025-05-07T20:32:28.6458038Z     @given(
2025-05-07T20:32:28.6458182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6458304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6458443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6458588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6458737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6458851Z     )
2025-05-07T20:32:28.6459134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6459245Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6459339Z         self,
2025-05-07T20:32:28.6459428Z         T: int,
2025-05-07T20:32:28.6459518Z         D: int,
2025-05-07T20:32:28.6459639Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6459742Z         contiguous: bool,
2025-05-07T20:32:28.6459930Z         compiled: bool,
2025-05-07T20:32:28.6460029Z     ) -> None:
2025-05-07T20:32:28.6460138Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6460228Z     
2025-05-07T20:32:28.6460423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6460517Z     
2025-05-07T20:32:28.6460629Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6460777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6460881Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6460980Z         x0 = x[:, :D]
2025-05-07T20:32:28.6461073Z         x1 = x[:, D:]
2025-05-07T20:32:28.6461157Z     
2025-05-07T20:32:28.6461261Z         if contiguous:
2025-05-07T20:32:28.6461369Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6461470Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6461559Z     
2025-05-07T20:32:28.6461663Z         if scale_ub is not None:
2025-05-07T20:32:28.6461790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6461951Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6462040Z             )
2025-05-07T20:32:28.6462137Z         else:
2025-05-07T20:32:28.6462245Z             scale_ub_tensor = None
2025-05-07T20:32:28.6462419Z     
2025-05-07T20:32:28.6462575Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6462679Z             op = silu_mul_quant
2025-05-07T20:32:28.6462778Z             if compiled:
2025-05-07T20:32:28.6462899Z                 op = torch.compile(op)
2025-05-07T20:32:28.6463020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6463105Z     
2025-05-07T20:32:28.6463215Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6463220Z 
2025-05-07T20:32:28.6463332Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6463487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6463605Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6463721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6464300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6464412Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6464827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6465089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6465476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6465591Z     kernel = self.compile(
2025-05-07T20:32:28.6466024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6466229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6466383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6466392Z 
2025-05-07T20:32:28.6466625Z self = <triton.compiler.compiler.ASTSource object at 0x7f76776d5900>
2025-05-07T20:32:28.6467506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6468081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677e26d40>}
2025-05-07T20:32:28.6468923Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6469147Z context = <triton._C.libtriton.ir.context object at 0x7f767703d130>
2025-05-07T20:32:28.6469152Z 
2025-05-07T20:32:28.6469430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6469736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6469859Z                            module_map=module_map)
2025-05-07T20:32:28.6470050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6470173Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6470262Z E       ^
2025-05-07T20:32:28.6470666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6470676Z 
2025-05-07T20:32:28.6471143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6471148Z 
2025-05-07T20:32:28.6471269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6471532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6471622Z     T=128,
2025-05-07T20:32:28.6471717Z     D=5120,
2025-05-07T20:32:28.6471819Z     scale_ub=None,
2025-05-07T20:32:28.6471920Z     contiguous=False,
2025-05-07T20:32:28.6472017Z     compiled=False,
2025-05-07T20:32:28.6472108Z )
2025-05-07T20:32:28.6472472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6472673Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6472678Z 
2025-05-07T20:32:28.6472768Z     @given(
2025-05-07T20:32:28.6472904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6473027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6473160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6473295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6473435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6473595Z     )
2025-05-07T20:32:28.6473901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6474011Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6474102Z         self,
2025-05-07T20:32:28.6474198Z         T: int,
2025-05-07T20:32:28.6474288Z         D: int,
2025-05-07T20:32:28.6474419Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6474530Z         contiguous: bool,
2025-05-07T20:32:28.6474630Z         compiled: bool,
2025-05-07T20:32:28.6474722Z     ) -> None:
2025-05-07T20:32:28.6474841Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6474926Z     
2025-05-07T20:32:28.6475123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6475218Z     
2025-05-07T20:32:28.6475326Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6475471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6475580Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6475675Z         x0 = x[:, :D]
2025-05-07T20:32:28.6475774Z         x1 = x[:, D:]
2025-05-07T20:32:28.6475863Z     
2025-05-07T20:32:28.6475965Z         if contiguous:
2025-05-07T20:32:28.6476083Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6476187Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6476274Z     
2025-05-07T20:32:28.6476386Z         if scale_ub is not None:
2025-05-07T20:32:28.6476514Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6476671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6476767Z             )
2025-05-07T20:32:28.6476856Z         else:
2025-05-07T20:32:28.6476966Z             scale_ub_tensor = None
2025-05-07T20:32:28.6477057Z     
2025-05-07T20:32:28.6477208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6477324Z             op = silu_mul_quant
2025-05-07T20:32:28.6477423Z             if compiled:
2025-05-07T20:32:28.6477540Z                 op = torch.compile(op)
2025-05-07T20:32:28.6477670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6477755Z     
2025-05-07T20:32:28.6477886Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6478011Z 
2025-05-07T20:32:28.6478187Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6478398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6478543Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6478670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6479236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6479358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6479766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6480022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6480417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6480525Z     kernel = self.compile(
2025-05-07T20:32:28.6480967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6481176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6481420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6481426Z 
2025-05-07T20:32:28.6481665Z self = <triton.compiler.compiler.ASTSource object at 0x7f76777388b0>
2025-05-07T20:32:28.6482538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6483115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767ca031c0>}
2025-05-07T20:32:28.6483961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6484180Z context = <triton._C.libtriton.ir.context object at 0x7f7676f0fbb0>
2025-05-07T20:32:28.6484190Z 
2025-05-07T20:32:28.6484387Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6484687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6484816Z                            module_map=module_map)
2025-05-07T20:32:28.6485002Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6485116Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6485213Z E       ^
2025-05-07T20:32:28.6485616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6485621Z 
2025-05-07T20:32:28.6486094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6486105Z 
2025-05-07T20:32:28.6486226Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6486488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6486585Z     T=128,
2025-05-07T20:32:28.6486675Z     D=5120,
2025-05-07T20:32:28.6486774Z     scale_ub=1200.0,
2025-05-07T20:32:28.6486878Z     contiguous=True,
2025-05-07T20:32:28.6486974Z     compiled=False,
2025-05-07T20:32:28.6487059Z )
2025-05-07T20:32:28.6487539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6487737Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.6487742Z 
2025-05-07T20:32:28.6487837Z     @given(
2025-05-07T20:32:28.6487973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6488088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6488342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6488499Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6488641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6488743Z     )
2025-05-07T20:32:28.6489025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6489135Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6489231Z         self,
2025-05-07T20:32:28.6489321Z         T: int,
2025-05-07T20:32:28.6489412Z         D: int,
2025-05-07T20:32:28.6489532Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6489635Z         contiguous: bool,
2025-05-07T20:32:28.6489742Z         compiled: bool,
2025-05-07T20:32:28.6489833Z     ) -> None:
2025-05-07T20:32:28.6489943Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6490035Z     
2025-05-07T20:32:28.6490232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6490317Z     
2025-05-07T20:32:28.6490437Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6490583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6490688Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6490789Z         x0 = x[:, :D]
2025-05-07T20:32:28.6491018Z         x1 = x[:, D:]
2025-05-07T20:32:28.6491104Z     
2025-05-07T20:32:28.6491206Z         if contiguous:
2025-05-07T20:32:28.6491313Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6491423Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6491507Z     
2025-05-07T20:32:28.6491611Z         if scale_ub is not None:
2025-05-07T20:32:28.6491745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6491899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6491988Z             )
2025-05-07T20:32:28.6492087Z         else:
2025-05-07T20:32:28.6492196Z             scale_ub_tensor = None
2025-05-07T20:32:28.6492281Z     
2025-05-07T20:32:28.6492435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6492546Z             op = silu_mul_quant
2025-05-07T20:32:28.6492645Z             if compiled:
2025-05-07T20:32:28.6492767Z                 op = torch.compile(op)
2025-05-07T20:32:28.6492889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6492980Z     
2025-05-07T20:32:28.6493092Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6493097Z 
2025-05-07T20:32:28.6493209Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6493362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6493481Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6493598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6494170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6494284Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6494701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6494966Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6495354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6495477Z     kernel = self.compile(
2025-05-07T20:32:28.6495911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6496113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6496264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6496269Z 
2025-05-07T20:32:28.6496502Z self = <triton.compiler.compiler.ASTSource object at 0x7f767c24f070>
2025-05-07T20:32:28.6497473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6498041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eabac0>}
2025-05-07T20:32:28.6498938Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6499157Z context = <triton._C.libtriton.ir.context object at 0x7f7676f065f0>
2025-05-07T20:32:28.6499161Z 
2025-05-07T20:32:28.6499352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6499660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6499784Z                            module_map=module_map)
2025-05-07T20:32:28.6499974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6500096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6500187Z E       ^
2025-05-07T20:32:28.6500595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6500684Z 
2025-05-07T20:32:28.6501154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6501159Z 
2025-05-07T20:32:28.6501281Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6501541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6501633Z     T=1,
2025-05-07T20:32:28.6501724Z     D=7168,
2025-05-07T20:32:28.6501830Z     scale_ub=1200.0,
2025-05-07T20:32:28.6501929Z     contiguous=True,
2025-05-07T20:32:28.6502030Z     compiled=True,
2025-05-07T20:32:28.6502115Z )
2025-05-07T20:32:28.6502363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6502564Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6502569Z 
2025-05-07T20:32:28.6502658Z     @given(
2025-05-07T20:32:28.6502796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6502924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6503058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6503198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6503337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6503424Z     )
2025-05-07T20:32:28.6503712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6503821Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6503910Z         self,
2025-05-07T20:32:28.6504005Z         T: int,
2025-05-07T20:32:28.6504095Z         D: int,
2025-05-07T20:32:28.6504207Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6504315Z         contiguous: bool,
2025-05-07T20:32:28.6504419Z         compiled: bool,
2025-05-07T20:32:28.6504510Z     ) -> None:
2025-05-07T20:32:28.6504624Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6504709Z     
2025-05-07T20:32:28.6504903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6505002Z     
2025-05-07T20:32:28.6505108Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6505260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6505363Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6505456Z         x0 = x[:, :D]
2025-05-07T20:32:28.6505555Z         x1 = x[:, D:]
2025-05-07T20:32:28.6505639Z     
2025-05-07T20:32:28.6505736Z         if contiguous:
2025-05-07T20:32:28.6505847Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6505951Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6506035Z     
2025-05-07T20:32:28.6506147Z         if scale_ub is not None:
2025-05-07T20:32:28.6506268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6506522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6506618Z             )
2025-05-07T20:32:28.6506704Z         else:
2025-05-07T20:32:28.6506817Z             scale_ub_tensor = None
2025-05-07T20:32:28.6506907Z     
2025-05-07T20:32:28.6507055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6507163Z             op = silu_mul_quant
2025-05-07T20:32:28.6507261Z             if compiled:
2025-05-07T20:32:28.6507377Z                 op = torch.compile(op)
2025-05-07T20:32:28.6507503Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6507586Z     
2025-05-07T20:32:28.6507691Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6507696Z 
2025-05-07T20:32:28.6507815Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6507963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6508084Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6508197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6508623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6508759Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6509317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6509546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6515352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6515630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6516022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6516138Z     kernel = self.compile(
2025-05-07T20:32:28.6516571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6516786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6516929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6516942Z 
2025-05-07T20:32:28.6517172Z self = <triton.compiler.compiler.ASTSource object at 0x7f767c24e7a0>
2025-05-07T20:32:28.6518054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6518625Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea9d80>}
2025-05-07T20:32:28.6519625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6519849Z context = <triton._C.libtriton.ir.context object at 0x7f7676f9cf70>
2025-05-07T20:32:28.6519855Z 
2025-05-07T20:32:28.6520042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6520357Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6520480Z                            module_map=module_map)
2025-05-07T20:32:28.6520667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6520783Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6520871Z E       ^
2025-05-07T20:32:28.6521277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6521282Z 
2025-05-07T20:32:28.6521746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6521751Z 
2025-05-07T20:32:28.6522004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6522260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6522348Z     T=1,
2025-05-07T20:32:28.6522444Z     D=7168,
2025-05-07T20:32:28.6522540Z     scale_ub=1200.0,
2025-05-07T20:32:28.6522641Z     contiguous=False,
2025-05-07T20:32:28.6522740Z     compiled=True,
2025-05-07T20:32:28.6522827Z )
2025-05-07T20:32:28.6523071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6523267Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6523272Z 
2025-05-07T20:32:28.6523359Z     @given(
2025-05-07T20:32:28.6523501Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6523614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6523745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6524149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6524288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6524372Z     )
2025-05-07T20:32:28.6524657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6524932Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6525021Z         self,
2025-05-07T20:32:28.6525115Z         T: int,
2025-05-07T20:32:28.6525200Z         D: int,
2025-05-07T20:32:28.6525315Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6525416Z         contiguous: bool,
2025-05-07T20:32:28.6525516Z         compiled: bool,
2025-05-07T20:32:28.6525610Z     ) -> None:
2025-05-07T20:32:28.6525716Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6525800Z     
2025-05-07T20:32:28.6525998Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6526080Z     
2025-05-07T20:32:28.6526186Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6526335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6526439Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6526531Z         x0 = x[:, :D]
2025-05-07T20:32:28.6526629Z         x1 = x[:, D:]
2025-05-07T20:32:28.6526711Z     
2025-05-07T20:32:28.6526805Z         if contiguous:
2025-05-07T20:32:28.6526921Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6527023Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6527110Z     
2025-05-07T20:32:28.6527213Z         if scale_ub is not None:
2025-05-07T20:32:28.6527333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6527492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6527577Z             )
2025-05-07T20:32:28.6527663Z         else:
2025-05-07T20:32:28.6527771Z             scale_ub_tensor = None
2025-05-07T20:32:28.6527854Z     
2025-05-07T20:32:28.6528002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6528106Z             op = silu_mul_quant
2025-05-07T20:32:28.6528204Z             if compiled:
2025-05-07T20:32:28.6528326Z                 op = torch.compile(op)
2025-05-07T20:32:28.6528445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6528526Z     
2025-05-07T20:32:28.6528633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6528644Z 
2025-05-07T20:32:28.6528756Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6528901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6529018Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6529132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6529553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6529659Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6530216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6530332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6530864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6531126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6531521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6531627Z     kernel = self.compile(
2025-05-07T20:32:28.6532059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6532259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6532403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6532408Z 
2025-05-07T20:32:28.6532641Z self = <triton.compiler.compiler.ASTSource object at 0x7f767781e440>
2025-05-07T20:32:28.6533512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6534081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677ea83a0>}
2025-05-07T20:32:28.6535000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6535221Z context = <triton._C.libtriton.ir.context object at 0x7f7677256eb0>
2025-05-07T20:32:28.6535226Z 
2025-05-07T20:32:28.6535414Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6535710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6535835Z                            module_map=module_map)
2025-05-07T20:32:28.6536024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6536139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6536230Z E       ^
2025-05-07T20:32:28.6536627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6536637Z 
2025-05-07T20:32:28.6537103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6537108Z 
2025-05-07T20:32:28.6537227Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6537480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6537571Z     T=1,
2025-05-07T20:32:28.6537658Z     D=7168,
2025-05-07T20:32:28.6537752Z     scale_ub=None,
2025-05-07T20:32:28.6537856Z     contiguous=False,
2025-05-07T20:32:28.6537950Z     compiled=True,
2025-05-07T20:32:28.6538031Z )
2025-05-07T20:32:28.6538283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6538470Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6538475Z 
2025-05-07T20:32:28.6538566Z     @given(
2025-05-07T20:32:28.6538705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6538820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6538955Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6539088Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6539216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6539305Z     )
2025-05-07T20:32:28.6539583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6539692Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6539780Z         self,
2025-05-07T20:32:28.6539867Z         T: int,
2025-05-07T20:32:28.6539956Z         D: int,
2025-05-07T20:32:28.6540067Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6540260Z         contiguous: bool,
2025-05-07T20:32:28.6540362Z         compiled: bool,
2025-05-07T20:32:28.6540451Z     ) -> None:
2025-05-07T20:32:28.6540558Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6540652Z     
2025-05-07T20:32:28.6540846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6540929Z     
2025-05-07T20:32:28.6541037Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6541178Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6541283Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6541375Z         x0 = x[:, :D]
2025-05-07T20:32:28.6541465Z         x1 = x[:, D:]
2025-05-07T20:32:28.6541550Z     
2025-05-07T20:32:28.6541646Z         if contiguous:
2025-05-07T20:32:28.6541748Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6541853Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6541935Z     
2025-05-07T20:32:28.6542036Z         if scale_ub is not None:
2025-05-07T20:32:28.6542164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6542317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6542404Z             )
2025-05-07T20:32:28.6542495Z         else:
2025-05-07T20:32:28.6542690Z             scale_ub_tensor = None
2025-05-07T20:32:28.6542772Z     
2025-05-07T20:32:28.6542924Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6543026Z             op = silu_mul_quant
2025-05-07T20:32:28.6543128Z             if compiled:
2025-05-07T20:32:28.6543241Z                 op = torch.compile(op)
2025-05-07T20:32:28.6543360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6543448Z     
2025-05-07T20:32:28.6543550Z         y_fp8, y_scale = fn()
2025-05-07T20:32:28.6543687Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:28.6543771Z     
2025-05-07T20:32:28.6543926Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6544042Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:28.6544165Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:28.6544303Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:28.6544465Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6544555Z     
2025-05-07T20:32:28.6544668Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:28.6544673Z 
2025-05-07T20:32:28.6544788Z moe/activation_test.py:126: 
2025-05-07T20:32:28.6544934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6545055Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:28.6545211Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:28.6545838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:28.6545957Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:28.6546368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6546620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6547036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:28.6547333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6547783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:28.6548072Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:28.6548538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:28.6548735Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:28.6549238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:28.6549328Z     fn()
2025-05-07T20:32:28.6549781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:28.6549883Z     self.fn.run(
2025-05-07T20:32:28.6550263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6550379Z     kernel = self.compile(
2025-05-07T20:32:28.6550808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6551011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6551153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6551158Z 
2025-05-07T20:32:28.6551389Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676f98a60>
2025-05-07T20:32:28.6552267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6552914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eaa3b0>}
2025-05-07T20:32:28.6553851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6554066Z context = <triton._C.libtriton.ir.context object at 0x7f767729d7f0>
2025-05-07T20:32:28.6554071Z 
2025-05-07T20:32:28.6554263Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6554564Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6554686Z                            module_map=module_map)
2025-05-07T20:32:28.6554873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6554990Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:28.6555086Z E       ^
2025-05-07T20:32:28.6555489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6555494Z 
2025-05-07T20:32:28.6555957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6555962Z 
2025-05-07T20:32:28.6556087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6556339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6556426Z     T=1,
2025-05-07T20:32:28.6556516Z     D=5120,
2025-05-07T20:32:28.6556609Z     scale_ub=1200.0,
2025-05-07T20:32:28.6556709Z     contiguous=False,
2025-05-07T20:32:28.6556813Z     compiled=True,
2025-05-07T20:32:28.6556896Z )
2025-05-07T20:32:28.6557143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6557336Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6557346Z 
2025-05-07T20:32:28.6557433Z     @given(
2025-05-07T20:32:28.6557571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6557683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6557817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6557954Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6558083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6558168Z     )
2025-05-07T20:32:28.6558452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6558558Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6558648Z         self,
2025-05-07T20:32:28.6558735Z         T: int,
2025-05-07T20:32:28.6558915Z         D: int,
2025-05-07T20:32:28.6559031Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6559133Z         contiguous: bool,
2025-05-07T20:32:28.6559229Z         compiled: bool,
2025-05-07T20:32:28.6559326Z     ) -> None:
2025-05-07T20:32:28.6559432Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6559514Z     
2025-05-07T20:32:28.6559786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6559911Z     
2025-05-07T20:32:28.6560065Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6560229Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6560331Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6560420Z         x0 = x[:, :D]
2025-05-07T20:32:28.6560515Z         x1 = x[:, D:]
2025-05-07T20:32:28.6560599Z     
2025-05-07T20:32:28.6560696Z         if contiguous:
2025-05-07T20:32:28.6560800Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6560924Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6561061Z     
2025-05-07T20:32:28.6561200Z         if scale_ub is not None:
2025-05-07T20:32:28.6561323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6561481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6561676Z             )
2025-05-07T20:32:28.6561763Z         else:
2025-05-07T20:32:28.6561875Z             scale_ub_tensor = None
2025-05-07T20:32:28.6561956Z     
2025-05-07T20:32:28.6562105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6562211Z             op = silu_mul_quant
2025-05-07T20:32:28.6562308Z             if compiled:
2025-05-07T20:32:28.6562425Z                 op = torch.compile(op)
2025-05-07T20:32:28.6562547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6562629Z     
2025-05-07T20:32:28.6562734Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6562739Z 
2025-05-07T20:32:28.6562849Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6563000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6563118Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6563232Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6563646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6563763Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6564323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6564441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6564846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6565100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6565488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6565598Z     kernel = self.compile(
2025-05-07T20:32:28.6566033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6566232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6566380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6566385Z 
2025-05-07T20:32:28.6566618Z self = <triton.compiler.compiler.ASTSource object at 0x7f76771995d0>
2025-05-07T20:32:28.6567483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6568048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9c60>}
2025-05-07T20:32:28.6568969Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6569192Z context = <triton._C.libtriton.ir.context object at 0x7f7676d71e30>
2025-05-07T20:32:28.6569197Z 
2025-05-07T20:32:28.6569389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6569686Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6569815Z                            module_map=module_map)
2025-05-07T20:32:28.6569999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6570112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6570201Z E       ^
2025-05-07T20:32:28.6570598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6570603Z 
2025-05-07T20:32:28.6571075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6571079Z 
2025-05-07T20:32:28.6571199Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6571535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6571627Z     T=1,
2025-05-07T20:32:28.6571715Z     D=5120,
2025-05-07T20:32:28.6571809Z     scale_ub=1200.0,
2025-05-07T20:32:28.6571912Z     contiguous=False,
2025-05-07T20:32:28.6572006Z     compiled=False,
2025-05-07T20:32:28.6572088Z )
2025-05-07T20:32:28.6572334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6572524Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6572529Z 
2025-05-07T20:32:28.6572619Z     @given(
2025-05-07T20:32:28.6572754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6572873Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6573006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6573140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6573269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6573363Z     )
2025-05-07T20:32:28.6573640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6573745Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6573834Z         self,
2025-05-07T20:32:28.6573923Z         T: int,
2025-05-07T20:32:28.6574015Z         D: int,
2025-05-07T20:32:28.6574125Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6574226Z         contiguous: bool,
2025-05-07T20:32:28.6574323Z         compiled: bool,
2025-05-07T20:32:28.6574412Z     ) -> None:
2025-05-07T20:32:28.6574519Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6574605Z     
2025-05-07T20:32:28.6574794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6574884Z     
2025-05-07T20:32:28.6574993Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6575134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6575234Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6575337Z         x0 = x[:, :D]
2025-05-07T20:32:28.6575429Z         x1 = x[:, D:]
2025-05-07T20:32:28.6575511Z     
2025-05-07T20:32:28.6575609Z         if contiguous:
2025-05-07T20:32:28.6575713Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6575816Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6575899Z     
2025-05-07T20:32:28.6576001Z         if scale_ub is not None:
2025-05-07T20:32:28.6576127Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6576281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6576366Z             )
2025-05-07T20:32:28.6576454Z         else:
2025-05-07T20:32:28.6576561Z             scale_ub_tensor = None
2025-05-07T20:32:28.6576642Z     
2025-05-07T20:32:28.6576891Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6576996Z             op = silu_mul_quant
2025-05-07T20:32:28.6577094Z             if compiled:
2025-05-07T20:32:28.6577211Z                 op = torch.compile(op)
2025-05-07T20:32:28.6577342Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6577427Z     
2025-05-07T20:32:28.6577530Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6577535Z 
2025-05-07T20:32:28.6577644Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6577793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6577906Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6578020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6578635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6578745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6579158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6579410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6579792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6579985Z     kernel = self.compile(
2025-05-07T20:32:28.6580416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6580618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6580764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6580769Z 
2025-05-07T20:32:28.6580996Z self = <triton.compiler.compiler.ASTSource object at 0x7f767729b820>
2025-05-07T20:32:28.6581873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6582441Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9750>}
2025-05-07T20:32:28.6583281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6583498Z context = <triton._C.libtriton.ir.context object at 0x7f7676d524b0>
2025-05-07T20:32:28.6583503Z 
2025-05-07T20:32:28.6583689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6583986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6584107Z                            module_map=module_map)
2025-05-07T20:32:28.6584293Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6584407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6584493Z E       ^
2025-05-07T20:32:28.6584892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6584903Z 
2025-05-07T20:32:28.6585364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6585369Z 
2025-05-07T20:32:28.6585487Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6585738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6585826Z     T=16384,
2025-05-07T20:32:28.6585911Z     D=5120,
2025-05-07T20:32:28.6586008Z     scale_ub=1200.0,
2025-05-07T20:32:28.6586105Z     contiguous=False,
2025-05-07T20:32:28.6586203Z     compiled=True,
2025-05-07T20:32:28.6586285Z )
2025-05-07T20:32:28.6586641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6586853Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6586858Z 
2025-05-07T20:32:28.6586944Z     @given(
2025-05-07T20:32:28.6587086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6587207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6587338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6587472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6587612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6587699Z     )
2025-05-07T20:32:28.6587986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6588095Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6588183Z         self,
2025-05-07T20:32:28.6588276Z         T: int,
2025-05-07T20:32:28.6588364Z         D: int,
2025-05-07T20:32:28.6588477Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6588591Z         contiguous: bool,
2025-05-07T20:32:28.6588690Z         compiled: bool,
2025-05-07T20:32:28.6588781Z     ) -> None:
2025-05-07T20:32:28.6588896Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6588981Z     
2025-05-07T20:32:28.6589263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6589356Z     
2025-05-07T20:32:28.6589462Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6589611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6589713Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6589805Z         x0 = x[:, :D]
2025-05-07T20:32:28.6589904Z         x1 = x[:, D:]
2025-05-07T20:32:28.6589989Z     
2025-05-07T20:32:28.6590084Z         if contiguous:
2025-05-07T20:32:28.6590195Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6590298Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6590381Z     
2025-05-07T20:32:28.6590490Z         if scale_ub is not None:
2025-05-07T20:32:28.6590618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6590772Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6590864Z             )
2025-05-07T20:32:28.6590952Z         else:
2025-05-07T20:32:28.6591073Z             scale_ub_tensor = None
2025-05-07T20:32:28.6591158Z     
2025-05-07T20:32:28.6591306Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6591417Z             op = silu_mul_quant
2025-05-07T20:32:28.6591515Z             if compiled:
2025-05-07T20:32:28.6591628Z                 op = torch.compile(op)
2025-05-07T20:32:28.6591757Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6591842Z     
2025-05-07T20:32:28.6591946Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6591951Z 
2025-05-07T20:32:28.6592067Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6592216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6592339Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6592459Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6592875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6592987Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6593644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6593764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6594173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6594427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6594817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6594925Z     kernel = self.compile(
2025-05-07T20:32:28.6595527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6595736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6595880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6595891Z 
2025-05-07T20:32:28.6596122Z self = <triton.compiler.compiler.ASTSource object at 0x7f767782c7c0>
2025-05-07T20:32:28.6596993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6597553Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d96c0>}
2025-05-07T20:32:28.6598397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6598611Z context = <triton._C.libtriton.ir.context object at 0x7f7676d29fb0>
2025-05-07T20:32:28.6598616Z 
2025-05-07T20:32:28.6598899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6599198Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6599321Z                            module_map=module_map)
2025-05-07T20:32:28.6599509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6599623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6599711Z E       ^
2025-05-07T20:32:28.6600118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6600122Z 
2025-05-07T20:32:28.6600589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6600594Z 
2025-05-07T20:32:28.6600718Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6600971Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6601065Z     T=2048,
2025-05-07T20:32:28.6601160Z     D=7168,
2025-05-07T20:32:28.6601256Z     scale_ub=1200.0,
2025-05-07T20:32:28.6601358Z     contiguous=False,
2025-05-07T20:32:28.6601462Z     compiled=True,
2025-05-07T20:32:28.6601547Z )
2025-05-07T20:32:28.6601798Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6601997Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6602002Z 
2025-05-07T20:32:28.6602091Z     @given(
2025-05-07T20:32:28.6602233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6602348Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6602480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6602624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6602754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6602841Z     )
2025-05-07T20:32:28.6603125Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6603239Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6603333Z         self,
2025-05-07T20:32:28.6603421Z         T: int,
2025-05-07T20:32:28.6603510Z         D: int,
2025-05-07T20:32:28.6603631Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6603733Z         contiguous: bool,
2025-05-07T20:32:28.6603832Z         compiled: bool,
2025-05-07T20:32:28.6603931Z     ) -> None:
2025-05-07T20:32:28.6604039Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6604123Z     
2025-05-07T20:32:28.6604319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6604405Z     
2025-05-07T20:32:28.6604512Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6604749Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6604854Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6604953Z         x0 = x[:, :D]
2025-05-07T20:32:28.6605044Z         x1 = x[:, D:]
2025-05-07T20:32:28.6605127Z     
2025-05-07T20:32:28.6605234Z         if contiguous:
2025-05-07T20:32:28.6605337Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6605439Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6605527Z     
2025-05-07T20:32:28.6605630Z         if scale_ub is not None:
2025-05-07T20:32:28.6605749Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6605908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6605996Z             )
2025-05-07T20:32:28.6606083Z         else:
2025-05-07T20:32:28.6606196Z             scale_ub_tensor = None
2025-05-07T20:32:28.6606281Z     
2025-05-07T20:32:28.6606436Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6606540Z             op = silu_mul_quant
2025-05-07T20:32:28.6606643Z             if compiled:
2025-05-07T20:32:28.6606763Z                 op = torch.compile(op)
2025-05-07T20:32:28.6606883Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6606968Z     
2025-05-07T20:32:28.6607165Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6607170Z 
2025-05-07T20:32:28.6607282Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6607431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6607555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6607671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6608091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6608199Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6608809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6608926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6609337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6609591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6609989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6610097Z     kernel = self.compile(
2025-05-07T20:32:28.6610537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6610737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6610881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6610886Z 
2025-05-07T20:32:28.6611121Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676ddd240>
2025-05-07T20:32:28.6611993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6612560Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778d9d80>}
2025-05-07T20:32:28.6613401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6613616Z context = <triton._C.libtriton.ir.context object at 0x7f7676aff0b0>
2025-05-07T20:32:28.6613627Z 
2025-05-07T20:32:28.6613816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6614111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6614325Z                            module_map=module_map)
2025-05-07T20:32:28.6614941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6615055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6615151Z E       ^
2025-05-07T20:32:28.6615557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6615563Z 
2025-05-07T20:32:28.6616033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6616038Z 
2025-05-07T20:32:28.6616157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6616409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6616506Z     T=1,
2025-05-07T20:32:28.6616596Z     D=5120,
2025-05-07T20:32:28.6616690Z     scale_ub=None,
2025-05-07T20:32:28.6616795Z     contiguous=False,
2025-05-07T20:32:28.6616892Z     compiled=False,
2025-05-07T20:32:28.6616982Z )
2025-05-07T20:32:28.6617233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6617421Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6617544Z 
2025-05-07T20:32:28.6617639Z     @given(
2025-05-07T20:32:28.6617776Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6617892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6618030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6618164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6618293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6618385Z     )
2025-05-07T20:32:28.6618665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6618780Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6618867Z         self,
2025-05-07T20:32:28.6618955Z         T: int,
2025-05-07T20:32:28.6619050Z         D: int,
2025-05-07T20:32:28.6619169Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6619271Z         contiguous: bool,
2025-05-07T20:32:28.6619376Z         compiled: bool,
2025-05-07T20:32:28.6619466Z     ) -> None:
2025-05-07T20:32:28.6619582Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6619673Z     
2025-05-07T20:32:28.6619865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6619952Z     
2025-05-07T20:32:28.6620063Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6620207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6620317Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6620411Z         x0 = x[:, :D]
2025-05-07T20:32:28.6620504Z         x1 = x[:, D:]
2025-05-07T20:32:28.6620593Z     
2025-05-07T20:32:28.6620689Z         if contiguous:
2025-05-07T20:32:28.6620793Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6620900Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6620984Z     
2025-05-07T20:32:28.6621092Z         if scale_ub is not None:
2025-05-07T20:32:28.6621221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6621377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6621465Z             )
2025-05-07T20:32:28.6621562Z         else:
2025-05-07T20:32:28.6621669Z             scale_ub_tensor = None
2025-05-07T20:32:28.6621752Z     
2025-05-07T20:32:28.6621905Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6622008Z             op = silu_mul_quant
2025-05-07T20:32:28.6622111Z             if compiled:
2025-05-07T20:32:28.6622225Z                 op = torch.compile(op)
2025-05-07T20:32:28.6622347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6622436Z     
2025-05-07T20:32:28.6622540Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6622545Z 
2025-05-07T20:32:28.6622655Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6622809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6623023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6623141Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6623712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6624048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6624529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6624784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6625168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6625284Z     kernel = self.compile(
2025-05-07T20:32:28.6625715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6625927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6626079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6626085Z 
2025-05-07T20:32:28.6626315Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676a52740>
2025-05-07T20:32:28.6627349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6627910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76778db520>}
2025-05-07T20:32:28.6628748Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6628968Z context = <triton._C.libtriton.ir.context object at 0x7f7676a9b830>
2025-05-07T20:32:28.6628973Z 
2025-05-07T20:32:28.6629161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6629467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6629598Z                            module_map=module_map)
2025-05-07T20:32:28.6629787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6629899Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6629989Z E       ^
2025-05-07T20:32:28.6630392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6630396Z 
2025-05-07T20:32:28.6630858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6630863Z 
2025-05-07T20:32:28.6630987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6631241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6631330Z     T=4096,
2025-05-07T20:32:28.6631425Z     D=7168,
2025-05-07T20:32:28.6631520Z     scale_ub=1200.0,
2025-05-07T20:32:28.6631629Z     contiguous=False,
2025-05-07T20:32:28.6631732Z     compiled=False,
2025-05-07T20:32:28.6631816Z )
2025-05-07T20:32:28.6632063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6632267Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6632272Z 
2025-05-07T20:32:28.6632360Z     @given(
2025-05-07T20:32:28.6632502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6632616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6632748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6632889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6633021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6633231Z     )
2025-05-07T20:32:28.6633574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6633683Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6633776Z         self,
2025-05-07T20:32:28.6633871Z         T: int,
2025-05-07T20:32:28.6633960Z         D: int,
2025-05-07T20:32:28.6634072Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6634180Z         contiguous: bool,
2025-05-07T20:32:28.6634278Z         compiled: bool,
2025-05-07T20:32:28.6634374Z     ) -> None:
2025-05-07T20:32:28.6634482Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6634565Z     
2025-05-07T20:32:28.6634761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6634848Z     
2025-05-07T20:32:28.6634954Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6635105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6635207Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6635307Z         x0 = x[:, :D]
2025-05-07T20:32:28.6635406Z         x1 = x[:, D:]
2025-05-07T20:32:28.6635490Z     
2025-05-07T20:32:28.6635586Z         if contiguous:
2025-05-07T20:32:28.6635696Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6635890Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6635980Z     
2025-05-07T20:32:28.6636088Z         if scale_ub is not None:
2025-05-07T20:32:28.6636208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6636367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6636454Z             )
2025-05-07T20:32:28.6636539Z         else:
2025-05-07T20:32:28.6636651Z             scale_ub_tensor = None
2025-05-07T20:32:28.6636736Z     
2025-05-07T20:32:28.6636884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6636991Z             op = silu_mul_quant
2025-05-07T20:32:28.6637090Z             if compiled:
2025-05-07T20:32:28.6637205Z                 op = torch.compile(op)
2025-05-07T20:32:28.6637338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6637421Z     
2025-05-07T20:32:28.6637527Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6637533Z 
2025-05-07T20:32:28.6637642Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6637798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6637919Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6638033Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6638592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6638708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6639117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6639374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6639762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6639870Z     kernel = self.compile(
2025-05-07T20:32:28.6640305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6640514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6640656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6640665Z 
2025-05-07T20:32:28.6640895Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676a68b80>
2025-05-07T20:32:28.6641763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6642416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb0550>}
2025-05-07T20:32:28.6643250Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6643475Z context = <triton._C.libtriton.ir.context object at 0x7f7676c0d3f0>
2025-05-07T20:32:28.6643480Z 
2025-05-07T20:32:28.6643668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6643965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6644092Z                            module_map=module_map)
2025-05-07T20:32:28.6644275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6644388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6644481Z E       ^
2025-05-07T20:32:28.6644885Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6644890Z 
2025-05-07T20:32:28.6645359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6645447Z 
2025-05-07T20:32:28.6645568Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6645819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6645915Z     T=16384,
2025-05-07T20:32:28.6646007Z     D=7168,
2025-05-07T20:32:28.6646105Z     scale_ub=None,
2025-05-07T20:32:28.6646204Z     contiguous=True,
2025-05-07T20:32:28.6646299Z     compiled=True,
2025-05-07T20:32:28.6646389Z )
2025-05-07T20:32:28.6646635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6646832Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6646837Z 
2025-05-07T20:32:28.6646931Z     @given(
2025-05-07T20:32:28.6647071Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6647186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6647323Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6647457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6647598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6647686Z     )
2025-05-07T20:32:28.6647965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6648079Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6648167Z         self,
2025-05-07T20:32:28.6648257Z         T: int,
2025-05-07T20:32:28.6648353Z         D: int,
2025-05-07T20:32:28.6648465Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6648569Z         contiguous: bool,
2025-05-07T20:32:28.6648674Z         compiled: bool,
2025-05-07T20:32:28.6648763Z     ) -> None:
2025-05-07T20:32:28.6648870Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6648961Z     
2025-05-07T20:32:28.6649155Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6649245Z     
2025-05-07T20:32:28.6649351Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6649493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6649609Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6649703Z         x0 = x[:, :D]
2025-05-07T20:32:28.6649794Z         x1 = x[:, D:]
2025-05-07T20:32:28.6649881Z     
2025-05-07T20:32:28.6649977Z         if contiguous:
2025-05-07T20:32:28.6650080Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6650189Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6650270Z     
2025-05-07T20:32:28.6650373Z         if scale_ub is not None:
2025-05-07T20:32:28.6650499Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6650652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6650740Z             )
2025-05-07T20:32:28.6650832Z         else:
2025-05-07T20:32:28.6650940Z             scale_ub_tensor = None
2025-05-07T20:32:28.6651116Z     
2025-05-07T20:32:28.6651266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6651370Z             op = silu_mul_quant
2025-05-07T20:32:28.6651475Z             if compiled:
2025-05-07T20:32:28.6651593Z                 op = torch.compile(op)
2025-05-07T20:32:28.6651715Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6656997Z     
2025-05-07T20:32:28.6657127Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6657133Z 
2025-05-07T20:32:28.6657247Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6657399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6657516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6657635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6658062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6658168Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6658734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6658844Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6659389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6659645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6660026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6660135Z     kernel = self.compile(
2025-05-07T20:32:28.6660571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6660770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6660916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6660925Z 
2025-05-07T20:32:28.6661156Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677198d60>
2025-05-07T20:32:28.6662024Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6662599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb1360>}
2025-05-07T20:32:28.6663432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6663652Z context = <triton._C.libtriton.ir.context object at 0x7f7676c20cf0>
2025-05-07T20:32:28.6663658Z 
2025-05-07T20:32:28.6663851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6664152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6664275Z                            module_map=module_map)
2025-05-07T20:32:28.6664463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6664583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6664670Z E       ^
2025-05-07T20:32:28.6665067Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6665073Z 
2025-05-07T20:32:28.6665537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6665543Z 
2025-05-07T20:32:28.6665660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6665915Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6666001Z     T=4096,
2025-05-07T20:32:28.6666178Z     D=5120,
2025-05-07T20:32:28.6666277Z     scale_ub=None,
2025-05-07T20:32:28.6666378Z     contiguous=False,
2025-05-07T20:32:28.6666474Z     compiled=True,
2025-05-07T20:32:28.6666562Z )
2025-05-07T20:32:28.6666813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6667008Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6667017Z 
2025-05-07T20:32:28.6667104Z     @given(
2025-05-07T20:32:28.6667238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6667354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6667484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6667616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6667748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6667836Z     )
2025-05-07T20:32:28.6668120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6668234Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6668322Z         self,
2025-05-07T20:32:28.6668431Z         T: int,
2025-05-07T20:32:28.6668531Z         D: int,
2025-05-07T20:32:28.6668751Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6668856Z         contiguous: bool,
2025-05-07T20:32:28.6668955Z         compiled: bool,
2025-05-07T20:32:28.6669044Z     ) -> None:
2025-05-07T20:32:28.6669156Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6669239Z     
2025-05-07T20:32:28.6669428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6669515Z     
2025-05-07T20:32:28.6669619Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6669762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6669867Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6669961Z         x0 = x[:, :D]
2025-05-07T20:32:28.6670052Z         x1 = x[:, D:]
2025-05-07T20:32:28.6670136Z     
2025-05-07T20:32:28.6670239Z         if contiguous:
2025-05-07T20:32:28.6670345Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6670446Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6670527Z     
2025-05-07T20:32:28.6670634Z         if scale_ub is not None:
2025-05-07T20:32:28.6670760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6670912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6671001Z             )
2025-05-07T20:32:28.6671087Z         else:
2025-05-07T20:32:28.6671198Z             scale_ub_tensor = None
2025-05-07T20:32:28.6671281Z     
2025-05-07T20:32:28.6671427Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6671532Z             op = silu_mul_quant
2025-05-07T20:32:28.6671630Z             if compiled:
2025-05-07T20:32:28.6671744Z                 op = torch.compile(op)
2025-05-07T20:32:28.6671867Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6671948Z     
2025-05-07T20:32:28.6672051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6672060Z 
2025-05-07T20:32:28.6672174Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6672320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6672438Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6672559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6672974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6673082Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6673722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6673832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6674237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6674489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6674962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6675071Z     kernel = self.compile(
2025-05-07T20:32:28.6675498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6675705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6675847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6675852Z 
2025-05-07T20:32:28.6676078Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676c080d0>
2025-05-07T20:32:28.6676950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6677522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb1ea0>}
2025-05-07T20:32:28.6678357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6678656Z context = <triton._C.libtriton.ir.context object at 0x7f76771ee6b0>
2025-05-07T20:32:28.6678661Z 
2025-05-07T20:32:28.6678850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6679145Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6679266Z                            module_map=module_map)
2025-05-07T20:32:28.6679454Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6679567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6679656Z E       ^
2025-05-07T20:32:28.6680062Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6680066Z 
2025-05-07T20:32:28.6680527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6680539Z 
2025-05-07T20:32:28.6680660Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6680910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6680999Z     T=4096,
2025-05-07T20:32:28.6681092Z     D=5120,
2025-05-07T20:32:28.6681185Z     scale_ub=1200.0,
2025-05-07T20:32:28.6681283Z     contiguous=False,
2025-05-07T20:32:28.6681381Z     compiled=False,
2025-05-07T20:32:28.6681464Z )
2025-05-07T20:32:28.6681710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6681908Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6681913Z 
2025-05-07T20:32:28.6682004Z     @given(
2025-05-07T20:32:28.6682143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6682257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6682387Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6682530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6682658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6682747Z     )
2025-05-07T20:32:28.6683025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6683133Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6683223Z         self,
2025-05-07T20:32:28.6683312Z         T: int,
2025-05-07T20:32:28.6683397Z         D: int,
2025-05-07T20:32:28.6683511Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6683612Z         contiguous: bool,
2025-05-07T20:32:28.6683710Z         compiled: bool,
2025-05-07T20:32:28.6683802Z     ) -> None:
2025-05-07T20:32:28.6683909Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6684082Z     
2025-05-07T20:32:28.6684282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6684366Z     
2025-05-07T20:32:28.6684472Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6684621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6684722Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6684815Z         x0 = x[:, :D]
2025-05-07T20:32:28.6684905Z         x1 = x[:, D:]
2025-05-07T20:32:28.6684987Z     
2025-05-07T20:32:28.6685085Z         if contiguous:
2025-05-07T20:32:28.6685188Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6685288Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6685376Z     
2025-05-07T20:32:28.6685477Z         if scale_ub is not None:
2025-05-07T20:32:28.6685597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6685752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6685838Z             )
2025-05-07T20:32:28.6685931Z         else:
2025-05-07T20:32:28.6686043Z             scale_ub_tensor = None
2025-05-07T20:32:28.6686123Z     
2025-05-07T20:32:28.6686273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6686374Z             op = silu_mul_quant
2025-05-07T20:32:28.6686556Z             if compiled:
2025-05-07T20:32:28.6686671Z                 op = torch.compile(op)
2025-05-07T20:32:28.6686790Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6686872Z     
2025-05-07T20:32:28.6686978Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6686983Z 
2025-05-07T20:32:28.6687091Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6687235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6687353Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6687465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6688032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6688143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6688546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6688810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6689195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6689302Z     kernel = self.compile(
2025-05-07T20:32:28.6689737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6689936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6690083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6690087Z 
2025-05-07T20:32:28.6690316Z self = <triton.compiler.compiler.ASTSource object at 0x7f767719b1c0>
2025-05-07T20:32:28.6691185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6691758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb2680>}
2025-05-07T20:32:28.6692588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6692806Z context = <triton._C.libtriton.ir.context object at 0x7f767711e670>
2025-05-07T20:32:28.6692811Z 
2025-05-07T20:32:28.6692998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6693408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6693532Z                            module_map=module_map)
2025-05-07T20:32:28.6693714Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6693833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6693921Z E       ^
2025-05-07T20:32:28.6694318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6694323Z 
2025-05-07T20:32:28.6694787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6694792Z 
2025-05-07T20:32:28.6694912Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6695167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6695255Z     T=4096,
2025-05-07T20:32:28.6695342Z     D=5120,
2025-05-07T20:32:28.6695440Z     scale_ub=1200.0,
2025-05-07T20:32:28.6695537Z     contiguous=False,
2025-05-07T20:32:28.6695638Z     compiled=True,
2025-05-07T20:32:28.6695725Z )
2025-05-07T20:32:28.6695971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6696164Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6696259Z 
2025-05-07T20:32:28.6696347Z     @given(
2025-05-07T20:32:28.6696483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6696602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6696732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6696864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6696997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6697083Z     )
2025-05-07T20:32:28.6697360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6697469Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6697556Z         self,
2025-05-07T20:32:28.6697647Z         T: int,
2025-05-07T20:32:28.6697739Z         D: int,
2025-05-07T20:32:28.6697850Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6697955Z         contiguous: bool,
2025-05-07T20:32:28.6698051Z         compiled: bool,
2025-05-07T20:32:28.6698145Z     ) -> None:
2025-05-07T20:32:28.6698256Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6698354Z     
2025-05-07T20:32:28.6698556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6698646Z     
2025-05-07T20:32:28.6698775Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6698938Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6699047Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6699139Z         x0 = x[:, :D]
2025-05-07T20:32:28.6699229Z         x1 = x[:, D:]
2025-05-07T20:32:28.6699315Z     
2025-05-07T20:32:28.6699410Z         if contiguous:
2025-05-07T20:32:28.6699517Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6699622Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6699704Z     
2025-05-07T20:32:28.6699815Z         if scale_ub is not None:
2025-05-07T20:32:28.6699937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6700090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6700187Z             )
2025-05-07T20:32:28.6700273Z         else:
2025-05-07T20:32:28.6700380Z             scale_ub_tensor = None
2025-05-07T20:32:28.6700471Z     
2025-05-07T20:32:28.6700619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6700723Z             op = silu_mul_quant
2025-05-07T20:32:28.6700822Z             if compiled:
2025-05-07T20:32:28.6700934Z                 op = torch.compile(op)
2025-05-07T20:32:28.6701054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6701140Z     
2025-05-07T20:32:28.6701244Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6701248Z 
2025-05-07T20:32:28.6701360Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6701597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6701713Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6701831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6702248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6702360Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6702927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6703038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6703445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6703700Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6704084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6704201Z     kernel = self.compile(
2025-05-07T20:32:28.6704633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6704925Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6705069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6705074Z 
2025-05-07T20:32:28.6705307Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676cd9660>
2025-05-07T20:32:28.6706180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6706744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676cb3ac0>}
2025-05-07T20:32:28.6707588Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6707809Z context = <triton._C.libtriton.ir.context object at 0x7f76771a99b0>
2025-05-07T20:32:28.6707814Z 
2025-05-07T20:32:28.6708002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6708303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6708424Z                            module_map=module_map)
2025-05-07T20:32:28.6708609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6708721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6708808Z E       ^
2025-05-07T20:32:28.6709211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6709221Z 
2025-05-07T20:32:28.6709686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6709691Z 
2025-05-07T20:32:28.6709816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6710073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6710163Z     T=2048,
2025-05-07T20:32:28.6710253Z     D=7168,
2025-05-07T20:32:28.6710347Z     scale_ub=1200.0,
2025-05-07T20:32:28.6710446Z     contiguous=False,
2025-05-07T20:32:28.6710544Z     compiled=False,
2025-05-07T20:32:28.6710627Z )
2025-05-07T20:32:28.6710875Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6711079Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6711084Z 
2025-05-07T20:32:28.6711171Z     @given(
2025-05-07T20:32:28.6711310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6711514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6711648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6711784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6711920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6712005Z     )
2025-05-07T20:32:28.6712288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6712394Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6712482Z         self,
2025-05-07T20:32:28.6712573Z         T: int,
2025-05-07T20:32:28.6712659Z         D: int,
2025-05-07T20:32:28.6712771Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6712876Z         contiguous: bool,
2025-05-07T20:32:28.6712975Z         compiled: bool,
2025-05-07T20:32:28.6713065Z     ) -> None:
2025-05-07T20:32:28.6713173Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6713255Z     
2025-05-07T20:32:28.6713456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6713586Z     
2025-05-07T20:32:28.6713691Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6713836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6713937Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6714117Z         x0 = x[:, :D]
2025-05-07T20:32:28.6714213Z         x1 = x[:, D:]
2025-05-07T20:32:28.6714295Z     
2025-05-07T20:32:28.6714391Z         if contiguous:
2025-05-07T20:32:28.6714498Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6714600Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6714682Z     
2025-05-07T20:32:28.6714789Z         if scale_ub is not None:
2025-05-07T20:32:28.6714910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6715074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6715163Z             )
2025-05-07T20:32:28.6715248Z         else:
2025-05-07T20:32:28.6715361Z             scale_ub_tensor = None
2025-05-07T20:32:28.6715444Z     
2025-05-07T20:32:28.6715597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6715704Z             op = silu_mul_quant
2025-05-07T20:32:28.6715801Z             if compiled:
2025-05-07T20:32:28.6715913Z                 op = torch.compile(op)
2025-05-07T20:32:28.6716044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6716126Z     
2025-05-07T20:32:28.6716230Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6716238Z 
2025-05-07T20:32:28.6716347Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6716495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6716615Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6716730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6717294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6717408Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6717817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6718073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6718464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6718571Z     kernel = self.compile(
2025-05-07T20:32:28.6719008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6719208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6719350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6719355Z 
2025-05-07T20:32:28.6719589Z self = <triton.compiler.compiler.ASTSource object at 0x7f767715bcd0>
2025-05-07T20:32:28.6721322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6721899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7677eaa290>}
2025-05-07T20:32:28.6722741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6722963Z context = <triton._C.libtriton.ir.context object at 0x7f76769a16f0>
2025-05-07T20:32:28.6722968Z 
2025-05-07T20:32:28.6723157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6723453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6723579Z                            module_map=module_map)
2025-05-07T20:32:28.6723967Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6724139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6724273Z E       ^
2025-05-07T20:32:28.6724691Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6724857Z 
2025-05-07T20:32:28.6725329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6725334Z 
2025-05-07T20:32:28.6725451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6725705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6725796Z     T=1,
2025-05-07T20:32:28.6725883Z     D=7168,
2025-05-07T20:32:28.6725976Z     scale_ub=None,
2025-05-07T20:32:28.6726076Z     contiguous=True,
2025-05-07T20:32:28.6726170Z     compiled=False,
2025-05-07T20:32:28.6726254Z )
2025-05-07T20:32:28.6726507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6726695Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.6726700Z 
2025-05-07T20:32:28.6726791Z     @given(
2025-05-07T20:32:28.6726933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6727048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6727181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6727315Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6727447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6727541Z     )
2025-05-07T20:32:28.6727821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6727936Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6728026Z         self,
2025-05-07T20:32:28.6728116Z         T: int,
2025-05-07T20:32:28.6728218Z         D: int,
2025-05-07T20:32:28.6728334Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6728443Z         contiguous: bool,
2025-05-07T20:32:28.6728549Z         compiled: bool,
2025-05-07T20:32:28.6728640Z     ) -> None:
2025-05-07T20:32:28.6728751Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6728849Z     
2025-05-07T20:32:28.6729042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6729128Z     
2025-05-07T20:32:28.6729240Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6729383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6729491Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6729584Z         x0 = x[:, :D]
2025-05-07T20:32:28.6729676Z         x1 = x[:, D:]
2025-05-07T20:32:28.6729764Z     
2025-05-07T20:32:28.6729860Z         if contiguous:
2025-05-07T20:32:28.6729965Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6730073Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6730156Z     
2025-05-07T20:32:28.6730262Z         if scale_ub is not None:
2025-05-07T20:32:28.6730546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6730705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6730792Z             )
2025-05-07T20:32:28.6730885Z         else:
2025-05-07T20:32:28.6731001Z             scale_ub_tensor = None
2025-05-07T20:32:28.6731085Z     
2025-05-07T20:32:28.6731240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6731344Z             op = silu_mul_quant
2025-05-07T20:32:28.6731450Z             if compiled:
2025-05-07T20:32:28.6731565Z                 op = torch.compile(op)
2025-05-07T20:32:28.6731687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6731774Z     
2025-05-07T20:32:28.6731881Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6731886Z 
2025-05-07T20:32:28.6732002Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6732156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6732274Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6732395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6732967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6733178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6733591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6733847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6734233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6734348Z     kernel = self.compile(
2025-05-07T20:32:28.6734781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6734992Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6735144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6735149Z 
2025-05-07T20:32:28.6735381Z self = <triton.compiler.compiler.ASTSource object at 0x7f7677185690>
2025-05-07T20:32:28.6736265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6736831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76769884c0>}
2025-05-07T20:32:28.6737675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6737892Z context = <triton._C.libtriton.ir.context object at 0x7f76769ae5f0>
2025-05-07T20:32:28.6737902Z 
2025-05-07T20:32:28.6738090Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6738422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6738573Z                            module_map=module_map)
2025-05-07T20:32:28.6738767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6738884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6738973Z E       ^
2025-05-07T20:32:28.6739381Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6739386Z 
2025-05-07T20:32:28.6739853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6739859Z 
2025-05-07T20:32:28.6739983Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6740324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6740418Z     T=16384,
2025-05-07T20:32:28.6740512Z     D=7168,
2025-05-07T20:32:28.6740610Z     scale_ub=1200.0,
2025-05-07T20:32:28.6740713Z     contiguous=False,
2025-05-07T20:32:28.6740820Z     compiled=True,
2025-05-07T20:32:28.6740905Z )
2025-05-07T20:32:28.6741154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6741367Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6741372Z 
2025-05-07T20:32:28.6741464Z     @given(
2025-05-07T20:32:28.6741607Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6741728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6741863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6742007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6742138Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6742226Z     )
2025-05-07T20:32:28.6742522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6742632Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6742723Z         self,
2025-05-07T20:32:28.6742983Z         T: int,
2025-05-07T20:32:28.6743070Z         D: int,
2025-05-07T20:32:28.6743184Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6743293Z         contiguous: bool,
2025-05-07T20:32:28.6743392Z         compiled: bool,
2025-05-07T20:32:28.6743488Z     ) -> None:
2025-05-07T20:32:28.6743597Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6743683Z     
2025-05-07T20:32:28.6743881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6743967Z     
2025-05-07T20:32:28.6744073Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6744223Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6744327Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6744419Z         x0 = x[:, :D]
2025-05-07T20:32:28.6744520Z         x1 = x[:, D:]
2025-05-07T20:32:28.6744605Z     
2025-05-07T20:32:28.6744703Z         if contiguous:
2025-05-07T20:32:28.6744816Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6744920Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6745017Z     
2025-05-07T20:32:28.6745121Z         if scale_ub is not None:
2025-05-07T20:32:28.6745243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6745402Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6745490Z             )
2025-05-07T20:32:28.6745576Z         else:
2025-05-07T20:32:28.6745688Z             scale_ub_tensor = None
2025-05-07T20:32:28.6745773Z     
2025-05-07T20:32:28.6745925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6746035Z             op = silu_mul_quant
2025-05-07T20:32:28.6746133Z             if compiled:
2025-05-07T20:32:28.6746248Z                 op = torch.compile(op)
2025-05-07T20:32:28.6746374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6746464Z     
2025-05-07T20:32:28.6746570Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6746580Z 
2025-05-07T20:32:28.6746690Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6746837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6746964Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6747078Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6747498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6747610Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6748172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6748291Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6748696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6749045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6749441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6749554Z     kernel = self.compile(
2025-05-07T20:32:28.6749987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6750198Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6750343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6750348Z 
2025-05-07T20:32:28.6750585Z self = <triton.compiler.compiler.ASTSource object at 0x7f767719a770>
2025-05-07T20:32:28.6751458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6752029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76769895a0>}
2025-05-07T20:32:28.6752955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6753172Z context = <triton._C.libtriton.ir.context object at 0x7f7676b7f270>
2025-05-07T20:32:28.6753177Z 
2025-05-07T20:32:28.6753374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6753729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6753852Z                            module_map=module_map)
2025-05-07T20:32:28.6754042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6754162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6754256Z E       ^
2025-05-07T20:32:28.6754656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6754667Z 
2025-05-07T20:32:28.6755132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6755138Z 
2025-05-07T20:32:28.6755263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6755518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6755614Z     T=1,
2025-05-07T20:32:28.6755703Z     D=7168,
2025-05-07T20:32:28.6755800Z     scale_ub=None,
2025-05-07T20:32:28.6755903Z     contiguous=False,
2025-05-07T20:32:28.6756001Z     compiled=False,
2025-05-07T20:32:28.6756086Z )
2025-05-07T20:32:28.6756337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6756531Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.6756536Z 
2025-05-07T20:32:28.6756626Z     @given(
2025-05-07T20:32:28.6756772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6756888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6757032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6757167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6757298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6757389Z     )
2025-05-07T20:32:28.6757671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6757780Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6757876Z         self,
2025-05-07T20:32:28.6757966Z         T: int,
2025-05-07T20:32:28.6758055Z         D: int,
2025-05-07T20:32:28.6758175Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6758280Z         contiguous: bool,
2025-05-07T20:32:28.6758380Z         compiled: bool,
2025-05-07T20:32:28.6758564Z     ) -> None:
2025-05-07T20:32:28.6758674Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6758764Z     
2025-05-07T20:32:28.6758958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6759047Z     
2025-05-07T20:32:28.6759162Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6759306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6759409Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6759510Z         x0 = x[:, :D]
2025-05-07T20:32:28.6759602Z         x1 = x[:, D:]
2025-05-07T20:32:28.6759686Z     
2025-05-07T20:32:28.6759788Z         if contiguous:
2025-05-07T20:32:28.6759894Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6759997Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6760084Z     
2025-05-07T20:32:28.6760189Z         if scale_ub is not None:
2025-05-07T20:32:28.6760310Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6760473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6760569Z             )
2025-05-07T20:32:28.6760666Z         else:
2025-05-07T20:32:28.6760778Z             scale_ub_tensor = None
2025-05-07T20:32:28.6760861Z     
2025-05-07T20:32:28.6761015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6761239Z             op = silu_mul_quant
2025-05-07T20:32:28.6761339Z             if compiled:
2025-05-07T20:32:28.6761460Z                 op = torch.compile(op)
2025-05-07T20:32:28.6761584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6761672Z     
2025-05-07T20:32:28.6761781Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6761785Z 
2025-05-07T20:32:28.6761900Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6762057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6762174Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6762293Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6762867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6762984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6763392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6763657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6764044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6764158Z     kernel = self.compile(
2025-05-07T20:32:28.6764590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6764791Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6764943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6764948Z 
2025-05-07T20:32:28.6765185Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676bc2b60>
2025-05-07T20:32:28.6766066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6766639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676989d80>}
2025-05-07T20:32:28.6767476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6767699Z context = <triton._C.libtriton.ir.context object at 0x7f7676b239f0>
2025-05-07T20:32:28.6767703Z 
2025-05-07T20:32:28.6767891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6768303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6768444Z                            module_map=module_map)
2025-05-07T20:32:28.6768644Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6768765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6768854Z E       ^
2025-05-07T20:32:28.6769255Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6769263Z 
2025-05-07T20:32:28.6769731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6769736Z 
2025-05-07T20:32:28.6769857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6770114Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6770205Z     T=2048,
2025-05-07T20:32:28.6770295Z     D=7168,
2025-05-07T20:32:28.6770404Z     scale_ub=None,
2025-05-07T20:32:28.6770504Z     contiguous=False,
2025-05-07T20:32:28.6770601Z     compiled=True,
2025-05-07T20:32:28.6770692Z )
2025-05-07T20:32:28.6770939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6771232Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6771237Z 
2025-05-07T20:32:28.6771325Z     @given(
2025-05-07T20:32:28.6771463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6771586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6771717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6771852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6771989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6772076Z     )
2025-05-07T20:32:28.6772362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6772476Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6772566Z         self,
2025-05-07T20:32:28.6772663Z         T: int,
2025-05-07T20:32:28.6772752Z         D: int,
2025-05-07T20:32:28.6772866Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6772981Z         contiguous: bool,
2025-05-07T20:32:28.6773081Z         compiled: bool,
2025-05-07T20:32:28.6773172Z     ) -> None:
2025-05-07T20:32:28.6773289Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6773373Z     
2025-05-07T20:32:28.6773566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6773659Z     
2025-05-07T20:32:28.6773766Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6773911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6774021Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6774115Z         x0 = x[:, :D]
2025-05-07T20:32:28.6774214Z         x1 = x[:, D:]
2025-05-07T20:32:28.6774299Z     
2025-05-07T20:32:28.6774395Z         if contiguous:
2025-05-07T20:32:28.6774512Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6774615Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6774700Z     
2025-05-07T20:32:28.6774808Z         if scale_ub is not None:
2025-05-07T20:32:28.6774931Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6775094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6775192Z             )
2025-05-07T20:32:28.6775281Z         else:
2025-05-07T20:32:28.6775390Z             scale_ub_tensor = None
2025-05-07T20:32:28.6775479Z     
2025-05-07T20:32:28.6775632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6775741Z             op = silu_mul_quant
2025-05-07T20:32:28.6775839Z             if compiled:
2025-05-07T20:32:28.6775954Z                 op = torch.compile(op)
2025-05-07T20:32:28.6776081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6776165Z     
2025-05-07T20:32:28.6776269Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6776273Z 
2025-05-07T20:32:28.6776486Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6776633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6776749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6776877Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6777296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6777410Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6777970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6778082Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6778495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6778749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6779143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6779262Z     kernel = self.compile(
2025-05-07T20:32:28.6779697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6779989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6780135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6780140Z 
2025-05-07T20:32:28.6780371Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676b5f8e0>
2025-05-07T20:32:28.6781252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6781819Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767698af80>}
2025-05-07T20:32:28.6782666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6782889Z context = <triton._C.libtriton.ir.context object at 0x7f76768572f0>
2025-05-07T20:32:28.6782894Z 
2025-05-07T20:32:28.6783088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6783388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6783513Z                            module_map=module_map)
2025-05-07T20:32:28.6783703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6783818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6783907Z E       ^
2025-05-07T20:32:28.6784318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6784323Z 
2025-05-07T20:32:28.6784789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6784800Z 
2025-05-07T20:32:28.6784927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6785183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6785274Z     T=4096,
2025-05-07T20:32:28.6785368Z     D=7168,
2025-05-07T20:32:28.6785465Z     scale_ub=None,
2025-05-07T20:32:28.6785564Z     contiguous=False,
2025-05-07T20:32:28.6785669Z     compiled=True,
2025-05-07T20:32:28.6785753Z )
2025-05-07T20:32:28.6786000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6786203Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6786208Z 
2025-05-07T20:32:28.6786298Z     @given(
2025-05-07T20:32:28.6786528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6786645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6786777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6786915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6787052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6787138Z     )
2025-05-07T20:32:28.6787424Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6787533Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6787629Z         self,
2025-05-07T20:32:28.6787718Z         T: int,
2025-05-07T20:32:28.6787806Z         D: int,
2025-05-07T20:32:28.6787926Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6788030Z         contiguous: bool,
2025-05-07T20:32:28.6788128Z         compiled: bool,
2025-05-07T20:32:28.6788224Z     ) -> None:
2025-05-07T20:32:28.6788333Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6788420Z     
2025-05-07T20:32:28.6788621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6788705Z     
2025-05-07T20:32:28.6788813Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6788964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6789155Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6789250Z         x0 = x[:, :D]
2025-05-07T20:32:28.6789346Z         x1 = x[:, D:]
2025-05-07T20:32:28.6789429Z     
2025-05-07T20:32:28.6789532Z         if contiguous:
2025-05-07T20:32:28.6789639Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6789744Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6789833Z     
2025-05-07T20:32:28.6789937Z         if scale_ub is not None:
2025-05-07T20:32:28.6790061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6790224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6790311Z             )
2025-05-07T20:32:28.6790403Z         else:
2025-05-07T20:32:28.6790525Z             scale_ub_tensor = None
2025-05-07T20:32:28.6790608Z     
2025-05-07T20:32:28.6790756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6790866Z             op = silu_mul_quant
2025-05-07T20:32:28.6790965Z             if compiled:
2025-05-07T20:32:28.6791092Z                 op = torch.compile(op)
2025-05-07T20:32:28.6791215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6791301Z     
2025-05-07T20:32:28.6791410Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6791415Z 
2025-05-07T20:32:28.6791525Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6791674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6791795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6791909Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6792325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6792436Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6793001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6793119Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6793575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6793830Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6794224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6794333Z     kernel = self.compile(
2025-05-07T20:32:28.6794769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6800327Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6800483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6800606Z 
2025-05-07T20:32:28.6800845Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676866140>
2025-05-07T20:32:28.6801717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6802287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f767698be20>}
2025-05-07T20:32:28.6803131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6803347Z context = <triton._C.libtriton.ir.context object at 0x7f767680cb70>
2025-05-07T20:32:28.6803352Z 
2025-05-07T20:32:28.6803551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6803849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6803971Z                            module_map=module_map)
2025-05-07T20:32:28.6804246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6804361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6804452Z E       ^
2025-05-07T20:32:28.6804857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6804862Z 
2025-05-07T20:32:28.6805330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6805334Z 
2025-05-07T20:32:28.6805455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6805706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6805806Z     T=16384,
2025-05-07T20:32:28.6805895Z     D=5120,
2025-05-07T20:32:28.6805988Z     scale_ub=1200.0,
2025-05-07T20:32:28.6806091Z     contiguous=False,
2025-05-07T20:32:28.6806187Z     compiled=False,
2025-05-07T20:32:28.6806280Z )
2025-05-07T20:32:28.6806529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6806731Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6806736Z 
2025-05-07T20:32:28.6806828Z     @given(
2025-05-07T20:32:28.6806968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6807080Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6807213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6807344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6807471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6807560Z     )
2025-05-07T20:32:28.6807844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6807950Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6808040Z         self,
2025-05-07T20:32:28.6808129Z         T: int,
2025-05-07T20:32:28.6808215Z         D: int,
2025-05-07T20:32:28.6808337Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6808437Z         contiguous: bool,
2025-05-07T20:32:28.6808533Z         compiled: bool,
2025-05-07T20:32:28.6808625Z     ) -> None:
2025-05-07T20:32:28.6808733Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6808819Z     
2025-05-07T20:32:28.6809008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6809093Z     
2025-05-07T20:32:28.6809203Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6809344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6809445Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6809541Z         x0 = x[:, :D]
2025-05-07T20:32:28.6809632Z         x1 = x[:, D:]
2025-05-07T20:32:28.6809713Z     
2025-05-07T20:32:28.6809900Z         if contiguous:
2025-05-07T20:32:28.6810006Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6810107Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6810192Z     
2025-05-07T20:32:28.6810294Z         if scale_ub is not None:
2025-05-07T20:32:28.6810420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6810578Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6810663Z             )
2025-05-07T20:32:28.6810755Z         else:
2025-05-07T20:32:28.6810860Z             scale_ub_tensor = None
2025-05-07T20:32:28.6810942Z     
2025-05-07T20:32:28.6811091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6811194Z             op = silu_mul_quant
2025-05-07T20:32:28.6811291Z             if compiled:
2025-05-07T20:32:28.6811406Z                 op = torch.compile(op)
2025-05-07T20:32:28.6811528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6811612Z     
2025-05-07T20:32:28.6811719Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6811723Z 
2025-05-07T20:32:28.6811833Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6811981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6812184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6812297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6812867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6812977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6813388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6813642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6814025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6814137Z     kernel = self.compile(
2025-05-07T20:32:28.6814575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6814775Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6814929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6814933Z 
2025-05-07T20:32:28.6815163Z self = <triton.compiler.compiler.ASTSource object at 0x7f767681f8e0>
2025-05-07T20:32:28.6816041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6816607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76768717e0>}
2025-05-07T20:32:28.6817450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6817665Z context = <triton._C.libtriton.ir.context object at 0x7f76768487b0>
2025-05-07T20:32:28.6817676Z 
2025-05-07T20:32:28.6817862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6818161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6818284Z                            module_map=module_map)
2025-05-07T20:32:28.6818472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6818584Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6818671Z E       ^
2025-05-07T20:32:28.6819073Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6819078Z 
2025-05-07T20:32:28.6819627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6819633Z 
2025-05-07T20:32:28.6819751Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6820013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6820099Z     T=16384,
2025-05-07T20:32:28.6820188Z     D=5120,
2025-05-07T20:32:28.6820283Z     scale_ub=1200.0,
2025-05-07T20:32:28.6820379Z     contiguous=True,
2025-05-07T20:32:28.6820477Z     compiled=True,
2025-05-07T20:32:28.6820559Z )
2025-05-07T20:32:28.6820805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6821006Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6821011Z 
2025-05-07T20:32:28.6821098Z     @given(
2025-05-07T20:32:28.6821231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6821349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6821485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6821622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6821750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6821922Z     )
2025-05-07T20:32:28.6822203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6822311Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6822397Z         self,
2025-05-07T20:32:28.6822488Z         T: int,
2025-05-07T20:32:28.6822575Z         D: int,
2025-05-07T20:32:28.6822685Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6822789Z         contiguous: bool,
2025-05-07T20:32:28.6822886Z         compiled: bool,
2025-05-07T20:32:28.6822972Z     ) -> None:
2025-05-07T20:32:28.6823082Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6823164Z     
2025-05-07T20:32:28.6823358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6823442Z     
2025-05-07T20:32:28.6823553Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6823701Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6824073Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6824215Z         x0 = x[:, :D]
2025-05-07T20:32:28.6824363Z         x1 = x[:, D:]
2025-05-07T20:32:28.6824446Z     
2025-05-07T20:32:28.6824541Z         if contiguous:
2025-05-07T20:32:28.6824648Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6824750Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6824831Z     
2025-05-07T20:32:28.6824939Z         if scale_ub is not None:
2025-05-07T20:32:28.6825059Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6825217Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6825303Z             )
2025-05-07T20:32:28.6825387Z         else:
2025-05-07T20:32:28.6825496Z             scale_ub_tensor = None
2025-05-07T20:32:28.6825576Z     
2025-05-07T20:32:28.6825729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6825835Z             op = silu_mul_quant
2025-05-07T20:32:28.6825932Z             if compiled:
2025-05-07T20:32:28.6826045Z                 op = torch.compile(op)
2025-05-07T20:32:28.6826169Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6826257Z     
2025-05-07T20:32:28.6826359Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6826369Z 
2025-05-07T20:32:28.6826480Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6826624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6826744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6826855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6827269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6827377Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6828092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6828206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6828667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6828925Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6829314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6829420Z     kernel = self.compile(
2025-05-07T20:32:28.6829851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6830053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6830196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6830201Z 
2025-05-07T20:32:28.6830440Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676831990>
2025-05-07T20:32:28.6831311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6832027Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676871090>}
2025-05-07T20:32:28.6832871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6833087Z context = <triton._C.libtriton.ir.context object at 0x7f76766bc030>
2025-05-07T20:32:28.6833092Z 
2025-05-07T20:32:28.6833281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6833650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6833772Z                            module_map=module_map)
2025-05-07T20:32:28.6833960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6834081Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6834176Z E       ^
2025-05-07T20:32:28.6834576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6834582Z 
2025-05-07T20:32:28.6835045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6835050Z 
2025-05-07T20:32:28.6835171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6835422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6835513Z     T=16384,
2025-05-07T20:32:28.6835601Z     D=5120,
2025-05-07T20:32:28.6835696Z     scale_ub=None,
2025-05-07T20:32:28.6835804Z     contiguous=False,
2025-05-07T20:32:28.6835898Z     compiled=True,
2025-05-07T20:32:28.6835980Z )
2025-05-07T20:32:28.6836228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6836433Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6836438Z 
2025-05-07T20:32:28.6836526Z     @given(
2025-05-07T20:32:28.6836663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6836776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6836906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6837042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6837172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6837260Z     )
2025-05-07T20:32:28.6837540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6837645Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6837734Z         self,
2025-05-07T20:32:28.6837914Z         T: int,
2025-05-07T20:32:28.6838003Z         D: int,
2025-05-07T20:32:28.6838118Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6838220Z         contiguous: bool,
2025-05-07T20:32:28.6838319Z         compiled: bool,
2025-05-07T20:32:28.6838410Z     ) -> None:
2025-05-07T20:32:28.6838519Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6838601Z     
2025-05-07T20:32:28.6838796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6838879Z     
2025-05-07T20:32:28.6838989Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6839129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6839231Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6839325Z         x0 = x[:, :D]
2025-05-07T20:32:28.6839416Z         x1 = x[:, D:]
2025-05-07T20:32:28.6839497Z     
2025-05-07T20:32:28.6839596Z         if contiguous:
2025-05-07T20:32:28.6839700Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6839808Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6839895Z     
2025-05-07T20:32:28.6839997Z         if scale_ub is not None:
2025-05-07T20:32:28.6840116Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6840360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6840445Z             )
2025-05-07T20:32:28.6840533Z         else:
2025-05-07T20:32:28.6840639Z             scale_ub_tensor = None
2025-05-07T20:32:28.6840720Z     
2025-05-07T20:32:28.6840869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6840972Z             op = silu_mul_quant
2025-05-07T20:32:28.6841067Z             if compiled:
2025-05-07T20:32:28.6841182Z                 op = torch.compile(op)
2025-05-07T20:32:28.6841303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6841383Z     
2025-05-07T20:32:28.6841490Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6841495Z 
2025-05-07T20:32:28.6841604Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6841761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6841875Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6841987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6842412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6842517Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6843074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6843189Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6843594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6843850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6844238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6844344Z     kernel = self.compile(
2025-05-07T20:32:28.6844776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6844979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6845122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6845127Z 
2025-05-07T20:32:28.6845359Z self = <triton.compiler.compiler.ASTSource object at 0x7f767661bc10>
2025-05-07T20:32:28.6846229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6846887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676872290>}
2025-05-07T20:32:28.6847727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6847952Z context = <triton._C.libtriton.ir.context object at 0x7f767667e5f0>
2025-05-07T20:32:28.6847957Z 
2025-05-07T20:32:28.6848146Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6848443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6848569Z                            module_map=module_map)
2025-05-07T20:32:28.6848751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6848863Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6848951Z E       ^
2025-05-07T20:32:28.6849357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6849363Z 
2025-05-07T20:32:28.6849832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6849837Z 
2025-05-07T20:32:28.6850040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6850293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6850385Z     T=2048,
2025-05-07T20:32:28.6850472Z     D=5120,
2025-05-07T20:32:28.6850564Z     scale_ub=None,
2025-05-07T20:32:28.6850667Z     contiguous=False,
2025-05-07T20:32:28.6850762Z     compiled=True,
2025-05-07T20:32:28.6850847Z )
2025-05-07T20:32:28.6851092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6851288Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.6851293Z 
2025-05-07T20:32:28.6851384Z     @given(
2025-05-07T20:32:28.6851516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6851634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6851767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6851899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6852039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6852124Z     )
2025-05-07T20:32:28.6852403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6852512Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6852598Z         self,
2025-05-07T20:32:28.6852686Z         T: int,
2025-05-07T20:32:28.6852773Z         D: int,
2025-05-07T20:32:28.6852883Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6852982Z         contiguous: bool,
2025-05-07T20:32:28.6853082Z         compiled: bool,
2025-05-07T20:32:28.6853173Z     ) -> None:
2025-05-07T20:32:28.6853279Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6853364Z     
2025-05-07T20:32:28.6853558Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6853642Z     
2025-05-07T20:32:28.6853751Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6853893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6854002Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6854093Z         x0 = x[:, :D]
2025-05-07T20:32:28.6854184Z         x1 = x[:, D:]
2025-05-07T20:32:28.6854268Z     
2025-05-07T20:32:28.6854363Z         if contiguous:
2025-05-07T20:32:28.6854466Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6854568Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6854649Z     
2025-05-07T20:32:28.6854750Z         if scale_ub is not None:
2025-05-07T20:32:28.6854876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6855029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6855114Z             )
2025-05-07T20:32:28.6855204Z         else:
2025-05-07T20:32:28.6855308Z             scale_ub_tensor = None
2025-05-07T20:32:28.6855392Z     
2025-05-07T20:32:28.6855627Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6855730Z             op = silu_mul_quant
2025-05-07T20:32:28.6855829Z             if compiled:
2025-05-07T20:32:28.6855940Z                 op = torch.compile(op)
2025-05-07T20:32:28.6856064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6856151Z     
2025-05-07T20:32:28.6856255Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6856260Z 
2025-05-07T20:32:28.6856369Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6856517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6856631Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6856746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6857164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6857269Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6857842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6857953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6858363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6858740Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6859138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6859249Z     kernel = self.compile(
2025-05-07T20:32:28.6859679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6859876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6860022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6860027Z 
2025-05-07T20:32:28.6860261Z self = <triton.compiler.compiler.ASTSource object at 0x7f76767109d0>
2025-05-07T20:32:28.6861139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6861706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676872170>}
2025-05-07T20:32:28.6862544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6862761Z context = <triton._C.libtriton.ir.context object at 0x7f76767debf0>
2025-05-07T20:32:28.6862766Z 
2025-05-07T20:32:28.6862953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6863258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6863379Z                            module_map=module_map)
2025-05-07T20:32:28.6863566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6863682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6863767Z E       ^
2025-05-07T20:32:28.6864166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6864175Z 
2025-05-07T20:32:28.6864638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6864643Z 
2025-05-07T20:32:28.6864761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6865014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6865102Z     T=2048,
2025-05-07T20:32:28.6865188Z     D=5120,
2025-05-07T20:32:28.6865403Z     scale_ub=1200.0,
2025-05-07T20:32:28.6865504Z     contiguous=False,
2025-05-07T20:32:28.6865597Z     compiled=True,
2025-05-07T20:32:28.6865683Z )
2025-05-07T20:32:28.6865935Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6866140Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6866145Z 
2025-05-07T20:32:28.6866230Z     @given(
2025-05-07T20:32:28.6866363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6866478Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6866606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6866737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6866869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6866952Z     )
2025-05-07T20:32:28.6867228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6867343Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6867428Z         self,
2025-05-07T20:32:28.6867517Z         T: int,
2025-05-07T20:32:28.6867602Z         D: int,
2025-05-07T20:32:28.6867711Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6867901Z         contiguous: bool,
2025-05-07T20:32:28.6867997Z         compiled: bool,
2025-05-07T20:32:28.6868084Z     ) -> None:
2025-05-07T20:32:28.6868191Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6868272Z     
2025-05-07T20:32:28.6868471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6868571Z     
2025-05-07T20:32:28.6868678Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6868846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6868974Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6869070Z         x0 = x[:, :D]
2025-05-07T20:32:28.6869171Z         x1 = x[:, D:]
2025-05-07T20:32:28.6869256Z     
2025-05-07T20:32:28.6869352Z         if contiguous:
2025-05-07T20:32:28.6869469Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6869570Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6869652Z     
2025-05-07T20:32:28.6869760Z         if scale_ub is not None:
2025-05-07T20:32:28.6869881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6870047Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6870139Z             )
2025-05-07T20:32:28.6870227Z         else:
2025-05-07T20:32:28.6870334Z             scale_ub_tensor = None
2025-05-07T20:32:28.6870424Z     
2025-05-07T20:32:28.6870576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6870679Z             op = silu_mul_quant
2025-05-07T20:32:28.6870781Z             if compiled:
2025-05-07T20:32:28.6870896Z                 op = torch.compile(op)
2025-05-07T20:32:28.6871023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6871107Z     
2025-05-07T20:32:28.6871212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6871217Z 
2025-05-07T20:32:28.6871335Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6871481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6871597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6871726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6872147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6872258Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6872823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6872934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6873348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6873657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6874139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6874255Z     kernel = self.compile(
2025-05-07T20:32:28.6874690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6874904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6875049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6875054Z 
2025-05-07T20:32:28.6875286Z self = <triton.compiler.compiler.ASTSource object at 0x7f76767c74c0>
2025-05-07T20:32:28.6876173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6876747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676873880>}
2025-05-07T20:32:28.6877601Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6877909Z context = <triton._C.libtriton.ir.context object at 0x7f76767200b0>
2025-05-07T20:32:28.6877914Z 
2025-05-07T20:32:28.6878112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6878418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6878539Z                            module_map=module_map)
2025-05-07T20:32:28.6878754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6878871Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6878959Z E       ^
2025-05-07T20:32:28.6879368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6879373Z 
2025-05-07T20:32:28.6879840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6879852Z 
2025-05-07T20:32:28.6879975Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6880227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6880315Z     T=4096,
2025-05-07T20:32:28.6880412Z     D=5120,
2025-05-07T20:32:28.6880509Z     scale_ub=1200.0,
2025-05-07T20:32:28.6880606Z     contiguous=True,
2025-05-07T20:32:28.6880705Z     compiled=True,
2025-05-07T20:32:28.6880788Z )
2025-05-07T20:32:28.6881032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6881233Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6881238Z 
2025-05-07T20:32:28.6881327Z     @given(
2025-05-07T20:32:28.6881473Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6881588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6881721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6881866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6881996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6882084Z     )
2025-05-07T20:32:28.6882370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6882478Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6882565Z         self,
2025-05-07T20:32:28.6882658Z         T: int,
2025-05-07T20:32:28.6882745Z         D: int,
2025-05-07T20:32:28.6882862Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6882964Z         contiguous: bool,
2025-05-07T20:32:28.6883063Z         compiled: bool,
2025-05-07T20:32:28.6883157Z     ) -> None:
2025-05-07T20:32:28.6883266Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6883349Z     
2025-05-07T20:32:28.6883640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6883726Z     
2025-05-07T20:32:28.6883833Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6883982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6884088Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6884179Z         x0 = x[:, :D]
2025-05-07T20:32:28.6884277Z         x1 = x[:, D:]
2025-05-07T20:32:28.6884358Z     
2025-05-07T20:32:28.6884453Z         if contiguous:
2025-05-07T20:32:28.6884563Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6884664Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6884750Z     
2025-05-07T20:32:28.6884853Z         if scale_ub is not None:
2025-05-07T20:32:28.6884975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6885133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6885222Z             )
2025-05-07T20:32:28.6885310Z         else:
2025-05-07T20:32:28.6885431Z             scale_ub_tensor = None
2025-05-07T20:32:28.6885514Z     
2025-05-07T20:32:28.6885660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6885768Z             op = silu_mul_quant
2025-05-07T20:32:28.6885952Z             if compiled:
2025-05-07T20:32:28.6886065Z                 op = torch.compile(op)
2025-05-07T20:32:28.6886190Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6886272Z     
2025-05-07T20:32:28.6886378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6886383Z 
2025-05-07T20:32:28.6886495Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6886643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6886764Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6886878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6887296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6887411Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6887969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6888088Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6888538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6888804Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6889194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6889302Z     kernel = self.compile(
2025-05-07T20:32:28.6889736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6889942Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6890090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6890095Z 
2025-05-07T20:32:28.6890331Z self = <triton.compiler.compiler.ASTSource object at 0x7f7676757820>
2025-05-07T20:32:28.6891205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6891780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f8940>}
2025-05-07T20:32:28.6892622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6892839Z context = <triton._C.libtriton.ir.context object at 0x7f76763255f0>
2025-05-07T20:32:28.6892844Z 
2025-05-07T20:32:28.6893128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6893430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6893564Z                            module_map=module_map)
2025-05-07T20:32:28.6893747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6893859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6893956Z E       ^
2025-05-07T20:32:28.6894358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6894364Z 
2025-05-07T20:32:28.6894829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6894841Z 
2025-05-07T20:32:28.6894959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6895212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6895310Z     T=128,
2025-05-07T20:32:28.6895399Z     D=5120,
2025-05-07T20:32:28.6895494Z     scale_ub=1200.0,
2025-05-07T20:32:28.6895598Z     contiguous=False,
2025-05-07T20:32:28.6895691Z     compiled=True,
2025-05-07T20:32:28.6895932Z )
2025-05-07T20:32:28.6896185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6896378Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6896383Z 
2025-05-07T20:32:28.6896469Z     @given(
2025-05-07T20:32:28.6896609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6896722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6896858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6896992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6897122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6897209Z     )
2025-05-07T20:32:28.6897492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6897599Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6897693Z         self,
2025-05-07T20:32:28.6897778Z         T: int,
2025-05-07T20:32:28.6897866Z         D: int,
2025-05-07T20:32:28.6897991Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6898093Z         contiguous: bool,
2025-05-07T20:32:28.6898193Z         compiled: bool,
2025-05-07T20:32:28.6898284Z     ) -> None:
2025-05-07T20:32:28.6898415Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6898510Z     
2025-05-07T20:32:28.6898721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6898806Z     
2025-05-07T20:32:28.6898916Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6899058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6899159Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6899256Z         x0 = x[:, :D]
2025-05-07T20:32:28.6899346Z         x1 = x[:, D:]
2025-05-07T20:32:28.6899430Z     
2025-05-07T20:32:28.6899534Z         if contiguous:
2025-05-07T20:32:28.6899638Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6899740Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6899831Z     
2025-05-07T20:32:28.6899938Z         if scale_ub is not None:
2025-05-07T20:32:28.6900064Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6900220Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6900308Z             )
2025-05-07T20:32:28.6900399Z         else:
2025-05-07T20:32:28.6900505Z             scale_ub_tensor = None
2025-05-07T20:32:28.6900590Z     
2025-05-07T20:32:28.6900748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6900851Z             op = silu_mul_quant
2025-05-07T20:32:28.6900949Z             if compiled:
2025-05-07T20:32:28.6901068Z                 op = torch.compile(op)
2025-05-07T20:32:28.6901189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6901273Z     
2025-05-07T20:32:28.6901509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6901514Z 
2025-05-07T20:32:28.6901626Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6901779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6901900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6902013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6902438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6902547Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6903107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6903225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6903632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6903897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6904283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6904394Z     kernel = self.compile(
2025-05-07T20:32:28.6904920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6905121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6905270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6905275Z 
2025-05-07T20:32:28.6905505Z self = <triton.compiler.compiler.ASTSource object at 0x7f767633b7f0>
2025-05-07T20:32:28.6906378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6906955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f91b0>}
2025-05-07T20:32:28.6907796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6908029Z context = <triton._C.libtriton.ir.context object at 0x7f76763ee9b0>
2025-05-07T20:32:28.6908035Z 
2025-05-07T20:32:28.6908227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6908533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6908662Z                            module_map=module_map)
2025-05-07T20:32:28.6908849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6908975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6909066Z E       ^
2025-05-07T20:32:28.6909480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6909485Z 
2025-05-07T20:32:28.6909967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6909977Z 
2025-05-07T20:32:28.6910100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6910357Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6910449Z     T=16384,
2025-05-07T20:32:28.6910536Z     D=7168,
2025-05-07T20:32:28.6910637Z     scale_ub=1200.0,
2025-05-07T20:32:28.6910734Z     contiguous=True,
2025-05-07T20:32:28.6910829Z     compiled=True,
2025-05-07T20:32:28.6910916Z )
2025-05-07T20:32:28.6911163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6911363Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6911455Z 
2025-05-07T20:32:28.6911551Z     @given(
2025-05-07T20:32:28.6911686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6911805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6911940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6912073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6912206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6912292Z     )
2025-05-07T20:32:28.6912572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6912684Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6912770Z         self,
2025-05-07T20:32:28.6912858Z         T: int,
2025-05-07T20:32:28.6912950Z         D: int,
2025-05-07T20:32:28.6913061Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6913162Z         contiguous: bool,
2025-05-07T20:32:28.6913262Z         compiled: bool,
2025-05-07T20:32:28.6913352Z     ) -> None:
2025-05-07T20:32:28.6913471Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6913600Z     
2025-05-07T20:32:28.6913794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6913884Z     
2025-05-07T20:32:28.6914083Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6914226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6914332Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6914425Z         x0 = x[:, :D]
2025-05-07T20:32:28.6914516Z         x1 = x[:, D:]
2025-05-07T20:32:28.6914603Z     
2025-05-07T20:32:28.6914699Z         if contiguous:
2025-05-07T20:32:28.6914804Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6914915Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6914997Z     
2025-05-07T20:32:28.6915100Z         if scale_ub is not None:
2025-05-07T20:32:28.6915226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6915381Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6915476Z             )
2025-05-07T20:32:28.6915570Z         else:
2025-05-07T20:32:28.6915680Z             scale_ub_tensor = None
2025-05-07T20:32:28.6915768Z     
2025-05-07T20:32:28.6915915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6916029Z             op = silu_mul_quant
2025-05-07T20:32:28.6916133Z             if compiled:
2025-05-07T20:32:28.6916247Z                 op = torch.compile(op)
2025-05-07T20:32:28.6916370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6916459Z     
2025-05-07T20:32:28.6916563Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6916568Z 
2025-05-07T20:32:28.6916686Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6916833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6916948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6917066Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6917488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6917593Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6918159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6918280Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6918744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6918999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6919383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6919495Z     kernel = self.compile(
2025-05-07T20:32:28.6919926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6920125Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6920365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6920371Z 
2025-05-07T20:32:28.6920604Z self = <triton.compiler.compiler.ASTSource object at 0x7f76763e8370>
2025-05-07T20:32:28.6921492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6922055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767f97e0>}
2025-05-07T20:32:28.6922906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6923126Z context = <triton._C.libtriton.ir.context object at 0x7f76765bbe70>
2025-05-07T20:32:28.6923131Z 
2025-05-07T20:32:28.6923319Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6923618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6924133Z                            module_map=module_map)
2025-05-07T20:32:28.6924347Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6924468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6924555Z E       ^
2025-05-07T20:32:28.6924961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6924966Z 
2025-05-07T20:32:28.6925432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6925437Z 
2025-05-07T20:32:28.6925555Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6925817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6925905Z     T=16384,
2025-05-07T20:32:28.6925997Z     D=5120,
2025-05-07T20:32:28.6926092Z     scale_ub=1200.0,
2025-05-07T20:32:28.6926188Z     contiguous=True,
2025-05-07T20:32:28.6926294Z     compiled=False,
2025-05-07T20:32:28.6926377Z )
2025-05-07T20:32:28.6926624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6926831Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.6926836Z 
2025-05-07T20:32:28.6926924Z     @given(
2025-05-07T20:32:28.6927060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6927181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6927312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6927454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6927584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6927668Z     )
2025-05-07T20:32:28.6927958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6928067Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6928155Z         self,
2025-05-07T20:32:28.6928253Z         T: int,
2025-05-07T20:32:28.6928341Z         D: int,
2025-05-07T20:32:28.6928454Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6928563Z         contiguous: bool,
2025-05-07T20:32:28.6928660Z         compiled: bool,
2025-05-07T20:32:28.6928747Z     ) -> None:
2025-05-07T20:32:28.6928862Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6928946Z     
2025-05-07T20:32:28.6929147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6929232Z     
2025-05-07T20:32:28.6929341Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6929487Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6929589Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6929682Z         x0 = x[:, :D]
2025-05-07T20:32:28.6929930Z         x1 = x[:, D:]
2025-05-07T20:32:28.6930017Z     
2025-05-07T20:32:28.6930112Z         if contiguous:
2025-05-07T20:32:28.6930223Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6930325Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6930412Z     
2025-05-07T20:32:28.6930521Z         if scale_ub is not None:
2025-05-07T20:32:28.6930642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6930798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6930891Z             )
2025-05-07T20:32:28.6930979Z         else:
2025-05-07T20:32:28.6931093Z             scale_ub_tensor = None
2025-05-07T20:32:28.6931175Z     
2025-05-07T20:32:28.6931324Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6931435Z             op = silu_mul_quant
2025-05-07T20:32:28.6931530Z             if compiled:
2025-05-07T20:32:28.6931643Z                 op = torch.compile(op)
2025-05-07T20:32:28.6931777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6931864Z     
2025-05-07T20:32:28.6931970Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6931975Z 
2025-05-07T20:32:28.6932094Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6932240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6932524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6932642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6933210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6933330Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6933738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6933990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6934388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6934495Z     kernel = self.compile(
2025-05-07T20:32:28.6934932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6935139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6935284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6935288Z 
2025-05-07T20:32:28.6935524Z self = <triton.compiler.compiler.ASTSource object at 0x7f767659fcd0>
2025-05-07T20:32:28.6936397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6936977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767fa950>}
2025-05-07T20:32:28.6937814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6938036Z context = <triton._C.libtriton.ir.context object at 0x7f76765ffdf0>
2025-05-07T20:32:28.6938046Z 
2025-05-07T20:32:28.6938235Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6938536Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6938664Z                            module_map=module_map)
2025-05-07T20:32:28.6944024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6944157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6944253Z E       ^
2025-05-07T20:32:28.6944818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6944825Z 
2025-05-07T20:32:28.6945297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6945310Z 
2025-05-07T20:32:28.6945436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6945689Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6945783Z     T=1,
2025-05-07T20:32:28.6945870Z     D=7168,
2025-05-07T20:32:28.6945964Z     scale_ub=1200.0,
2025-05-07T20:32:28.6946068Z     contiguous=False,
2025-05-07T20:32:28.6946163Z     compiled=False,
2025-05-07T20:32:28.6946244Z )
2025-05-07T20:32:28.6946492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6946686Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.6946691Z 
2025-05-07T20:32:28.6946780Z     @given(
2025-05-07T20:32:28.6946918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6947035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6947169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6947303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6947521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6947613Z     )
2025-05-07T20:32:28.6947892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6948002Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6948091Z         self,
2025-05-07T20:32:28.6948178Z         T: int,
2025-05-07T20:32:28.6948266Z         D: int,
2025-05-07T20:32:28.6948387Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6948488Z         contiguous: bool,
2025-05-07T20:32:28.6948584Z         compiled: bool,
2025-05-07T20:32:28.6948675Z     ) -> None:
2025-05-07T20:32:28.6948782Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6948868Z     
2025-05-07T20:32:28.6949064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6949148Z     
2025-05-07T20:32:28.6949255Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6949399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6949508Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6949603Z         x0 = x[:, :D]
2025-05-07T20:32:28.6949693Z         x1 = x[:, D:]
2025-05-07T20:32:28.6949773Z     
2025-05-07T20:32:28.6949871Z         if contiguous:
2025-05-07T20:32:28.6949975Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6950077Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6950160Z     
2025-05-07T20:32:28.6950262Z         if scale_ub is not None:
2025-05-07T20:32:28.6950384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6950538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6950628Z             )
2025-05-07T20:32:28.6950723Z         else:
2025-05-07T20:32:28.6950831Z             scale_ub_tensor = None
2025-05-07T20:32:28.6950917Z     
2025-05-07T20:32:28.6951070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6951173Z             op = silu_mul_quant
2025-05-07T20:32:28.6951269Z             if compiled:
2025-05-07T20:32:28.6951393Z                 op = torch.compile(op)
2025-05-07T20:32:28.6951513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6951595Z     
2025-05-07T20:32:28.6951702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6951707Z 
2025-05-07T20:32:28.6951816Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6951965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6952079Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6952196Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6952767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6952880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6953380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6953760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6954154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6954264Z     kernel = self.compile(
2025-05-07T20:32:28.6954695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6954898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6955041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6955046Z 
2025-05-07T20:32:28.6955274Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762c1ed0>
2025-05-07T20:32:28.6956157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6956726Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76767fbac0>}
2025-05-07T20:32:28.6957661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6957881Z context = <triton._C.libtriton.ir.context object at 0x7f767657a2b0>
2025-05-07T20:32:28.6957887Z 
2025-05-07T20:32:28.6958075Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6958405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6958557Z                            module_map=module_map)
2025-05-07T20:32:28.6958748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6958860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6958947Z E       ^
2025-05-07T20:32:28.6959356Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6959361Z 
2025-05-07T20:32:28.6959825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6959830Z 
2025-05-07T20:32:28.6959954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6960206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6960293Z     T=4096,
2025-05-07T20:32:28.6960385Z     D=7168,
2025-05-07T20:32:28.6960478Z     scale_ub=1200.0,
2025-05-07T20:32:28.6960576Z     contiguous=False,
2025-05-07T20:32:28.6960673Z     compiled=True,
2025-05-07T20:32:28.6960757Z )
2025-05-07T20:32:28.6961005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6961207Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6961218Z 
2025-05-07T20:32:28.6961304Z     @given(
2025-05-07T20:32:28.6961442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6961557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6961686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6961822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6961952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6962037Z     )
2025-05-07T20:32:28.6962319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6962427Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6962515Z         self,
2025-05-07T20:32:28.6962605Z         T: int,
2025-05-07T20:32:28.6962692Z         D: int,
2025-05-07T20:32:28.6962894Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6963003Z         contiguous: bool,
2025-05-07T20:32:28.6963099Z         compiled: bool,
2025-05-07T20:32:28.6963193Z     ) -> None:
2025-05-07T20:32:28.6963300Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6963389Z     
2025-05-07T20:32:28.6963585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6963669Z     
2025-05-07T20:32:28.6963775Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6963922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6964023Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6964113Z         x0 = x[:, :D]
2025-05-07T20:32:28.6964206Z         x1 = x[:, D:]
2025-05-07T20:32:28.6964287Z     
2025-05-07T20:32:28.6964381Z         if contiguous:
2025-05-07T20:32:28.6964488Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6964587Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6964671Z     
2025-05-07T20:32:28.6964779Z         if scale_ub is not None:
2025-05-07T20:32:28.6964898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6965058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6965142Z             )
2025-05-07T20:32:28.6965316Z         else:
2025-05-07T20:32:28.6965427Z             scale_ub_tensor = None
2025-05-07T20:32:28.6965512Z     
2025-05-07T20:32:28.6965660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6965767Z             op = silu_mul_quant
2025-05-07T20:32:28.6965862Z             if compiled:
2025-05-07T20:32:28.6965974Z                 op = torch.compile(op)
2025-05-07T20:32:28.6966097Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6966180Z     
2025-05-07T20:32:28.6966285Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6966290Z 
2025-05-07T20:32:28.6966399Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6966543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6966669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6966782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6967198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6967313Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6967870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6967983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6968387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6968642Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6969032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6969137Z     kernel = self.compile(
2025-05-07T20:32:28.6969575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6969778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6969928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6969932Z 
2025-05-07T20:32:28.6970164Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762407f0>
2025-05-07T20:32:28.6971037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6971603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d4550>}
2025-05-07T20:32:28.6972533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6972751Z context = <triton._C.libtriton.ir.context object at 0x7f76762e93f0>
2025-05-07T20:32:28.6972760Z 
2025-05-07T20:32:28.6972952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6973248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6973372Z                            module_map=module_map)
2025-05-07T20:32:28.6973555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6973667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6973757Z E       ^
2025-05-07T20:32:28.6974159Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6974163Z 
2025-05-07T20:32:28.6974632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6974637Z 
2025-05-07T20:32:28.6974757Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6975009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6975215Z     T=128,
2025-05-07T20:32:28.6975301Z     D=7168,
2025-05-07T20:32:28.6975394Z     scale_ub=1200.0,
2025-05-07T20:32:28.6975496Z     contiguous=False,
2025-05-07T20:32:28.6975591Z     compiled=True,
2025-05-07T20:32:28.6975674Z )
2025-05-07T20:32:28.6975924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6976118Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.6976123Z 
2025-05-07T20:32:28.6976210Z     @given(
2025-05-07T20:32:28.6976348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6976460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6976602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6976735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6976866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6976960Z     )
2025-05-07T20:32:28.6977246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6977352Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6977443Z         self,
2025-05-07T20:32:28.6977529Z         T: int,
2025-05-07T20:32:28.6977616Z         D: int,
2025-05-07T20:32:28.6977730Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6977830Z         contiguous: bool,
2025-05-07T20:32:28.6977926Z         compiled: bool,
2025-05-07T20:32:28.6978020Z     ) -> None:
2025-05-07T20:32:28.6978127Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6978213Z     
2025-05-07T20:32:28.6978404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6978488Z     
2025-05-07T20:32:28.6978599Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6978740Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6978840Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6978935Z         x0 = x[:, :D]
2025-05-07T20:32:28.6979031Z         x1 = x[:, D:]
2025-05-07T20:32:28.6979114Z     
2025-05-07T20:32:28.6979212Z         if contiguous:
2025-05-07T20:32:28.6979315Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6979415Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6979501Z     
2025-05-07T20:32:28.6979602Z         if scale_ub is not None:
2025-05-07T20:32:28.6979725Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6979878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6979964Z             )
2025-05-07T20:32:28.6980054Z         else:
2025-05-07T20:32:28.6980161Z             scale_ub_tensor = None
2025-05-07T20:32:28.6980243Z     
2025-05-07T20:32:28.6980393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6980585Z             op = silu_mul_quant
2025-05-07T20:32:28.6980683Z             if compiled:
2025-05-07T20:32:28.6980798Z                 op = torch.compile(op)
2025-05-07T20:32:28.6980916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6981003Z     
2025-05-07T20:32:28.6981108Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6981113Z 
2025-05-07T20:32:28.6981220Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6981369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6981482Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6981594Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6982015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6982119Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6982680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6982801Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6983204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6983546Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6983932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6984037Z     kernel = self.compile(
2025-05-07T20:32:28.6984473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6984673Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6984815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6984823Z 
2025-05-07T20:32:28.6985053Z self = <triton.compiler.compiler.ASTSource object at 0x7f76762423e0>
2025-05-07T20:32:28.6985931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6986506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d4f70>}
2025-05-07T20:32:28.6987346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6987566Z context = <triton._C.libtriton.ir.context object at 0x7f76762065b0>
2025-05-07T20:32:28.6987571Z 
2025-05-07T20:32:28.6987758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6988059Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6988182Z                            module_map=module_map)
2025-05-07T20:32:28.6988390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6988522Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6988621Z E       ^
2025-05-07T20:32:28.6989019Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6989024Z 
2025-05-07T20:32:28.6989493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6989498Z 
2025-05-07T20:32:28.6989615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6989867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6989956Z     T=2048,
2025-05-07T20:32:28.6990042Z     D=7168,
2025-05-07T20:32:28.6990138Z     scale_ub=None,
2025-05-07T20:32:28.6990234Z     contiguous=True,
2025-05-07T20:32:28.6990411Z     compiled=True,
2025-05-07T20:32:28.6990500Z )
2025-05-07T20:32:28.6990744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6990938Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.6990948Z 
2025-05-07T20:32:28.6991039Z     @given(
2025-05-07T20:32:28.6991173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6991288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6991425Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6991557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6991688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6991772Z     )
2025-05-07T20:32:28.6992050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6992157Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6992244Z         self,
2025-05-07T20:32:28.6992329Z         T: int,
2025-05-07T20:32:28.6992424Z         D: int,
2025-05-07T20:32:28.6992535Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6992636Z         contiguous: bool,
2025-05-07T20:32:28.6992737Z         compiled: bool,
2025-05-07T20:32:28.6992912Z     ) -> None:
2025-05-07T20:32:28.6993020Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6993108Z     
2025-05-07T20:32:28.6993299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6993386Z     
2025-05-07T20:32:28.6993489Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6993690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6993792Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6993884Z         x0 = x[:, :D]
2025-05-07T20:32:28.6993973Z         x1 = x[:, D:]
2025-05-07T20:32:28.6994059Z     
2025-05-07T20:32:28.6994153Z         if contiguous:
2025-05-07T20:32:28.6994256Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6994359Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6994446Z     
2025-05-07T20:32:28.6994549Z         if scale_ub is not None:
2025-05-07T20:32:28.6994672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6994825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6994921Z             )
2025-05-07T20:32:28.6995006Z         else:
2025-05-07T20:32:28.6995112Z             scale_ub_tensor = None
2025-05-07T20:32:28.6995197Z     
2025-05-07T20:32:28.6995343Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6995445Z             op = silu_mul_quant
2025-05-07T20:32:28.6995546Z             if compiled:
2025-05-07T20:32:28.6995661Z                 op = torch.compile(op)
2025-05-07T20:32:28.6995780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6995865Z     
2025-05-07T20:32:28.6995968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6995973Z 
2025-05-07T20:32:28.6996083Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6996236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6996350Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6996465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6996882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.6996992Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.6997559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6997668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6998074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6998330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6998715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6998917Z     kernel = self.compile(
2025-05-07T20:32:28.6999353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6999554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6999709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6999714Z 
2025-05-07T20:32:28.6999944Z self = <triton.compiler.compiler.ASTSource object at 0x7f767623ce80>
2025-05-07T20:32:28.7000821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7001387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d5bd0>}
2025-05-07T20:32:28.7002232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7002536Z context = <triton._C.libtriton.ir.context object at 0x7f767644bbf0>
2025-05-07T20:32:28.7002541Z 
2025-05-07T20:32:28.7002728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7003031Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7003151Z                            module_map=module_map)
2025-05-07T20:32:28.7003335Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7003454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7003543Z E       ^
2025-05-07T20:32:28.7003951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7003956Z 
2025-05-07T20:32:28.7004425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7004430Z 
2025-05-07T20:32:28.7004549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7004815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7004902Z     T=16384,
2025-05-07T20:32:28.7004990Z     D=5120,
2025-05-07T20:32:28.7005085Z     scale_ub=None,
2025-05-07T20:32:28.7005182Z     contiguous=False,
2025-05-07T20:32:28.7005281Z     compiled=False,
2025-05-07T20:32:28.7005364Z )
2025-05-07T20:32:28.7005610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7005812Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.7005817Z 
2025-05-07T20:32:28.7005905Z     @given(
2025-05-07T20:32:28.7006040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7006160Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7006290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7006423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7006559Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7006651Z     )
2025-05-07T20:32:28.7006933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7007041Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7007132Z         self,
2025-05-07T20:32:28.7007221Z         T: int,
2025-05-07T20:32:28.7007310Z         D: int,
2025-05-07T20:32:28.7007420Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7007522Z         contiguous: bool,
2025-05-07T20:32:28.7007618Z         compiled: bool,
2025-05-07T20:32:28.7007706Z     ) -> None:
2025-05-07T20:32:28.7007816Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7007897Z     
2025-05-07T20:32:28.7008087Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7008286Z     
2025-05-07T20:32:28.7008392Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7008532Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7010580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7010591Z 
2025-05-07T20:32:28.7010728Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.7010733Z 
2025-05-07T20:32:28.7010849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7011105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7011196Z     T=4096,
2025-05-07T20:32:28.7011283Z     D=7168,
2025-05-07T20:32:28.7011377Z     scale_ub=1200.0,
2025-05-07T20:32:28.7011476Z     contiguous=True,
2025-05-07T20:32:28.7011656Z     compiled=True,
2025-05-07T20:32:28.7011740Z )
2025-05-07T20:32:28.7011990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7012187Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.7012191Z 
2025-05-07T20:32:28.7012288Z     @given(
2025-05-07T20:32:28.7012424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7012540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7012676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7012809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7012939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7013034Z     )
2025-05-07T20:32:28.7013319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7013427Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7013524Z         self,
2025-05-07T20:32:28.7013618Z         T: int,
2025-05-07T20:32:28.7013710Z         D: int,
2025-05-07T20:32:28.7013824Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7013926Z         contiguous: bool,
2025-05-07T20:32:28.7014029Z         compiled: bool,
2025-05-07T20:32:28.7014118Z     ) -> None:
2025-05-07T20:32:28.7014225Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7014316Z     
2025-05-07T20:32:28.7014510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7014594Z     
2025-05-07T20:32:28.7014708Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7014852Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7016869Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7016884Z 
2025-05-07T20:32:28.7017019Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.7017024Z 
2025-05-07T20:32:28.7017140Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7017398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7017488Z     T=16384,
2025-05-07T20:32:28.7017587Z     D=7168,
2025-05-07T20:32:28.7017683Z     scale_ub=None,
2025-05-07T20:32:28.7017784Z     contiguous=False,
2025-05-07T20:32:28.7017888Z     compiled=False,
2025-05-07T20:32:28.7018065Z )
2025-05-07T20:32:28.7018314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7018544Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.7018555Z 
2025-05-07T20:32:28.7018648Z     @given(
2025-05-07T20:32:28.7018788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7018935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7019093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7019235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7019369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7019457Z     )
2025-05-07T20:32:28.7019748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7019857Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7019948Z         self,
2025-05-07T20:32:28.7020044Z         T: int,
2025-05-07T20:32:28.7020138Z         D: int,
2025-05-07T20:32:28.7020252Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7020360Z         contiguous: bool,
2025-05-07T20:32:28.7020462Z         compiled: bool,
2025-05-07T20:32:28.7020554Z     ) -> None:
2025-05-07T20:32:28.7020762Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7020845Z     
2025-05-07T20:32:28.7021043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7023084Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7023091Z 
2025-05-07T20:32:28.7023233Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7023237Z 
2025-05-07T20:32:28.7023354Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7023615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7023710Z     T=2048,
2025-05-07T20:32:28.7024146Z     D=7168,
2025-05-07T20:32:28.7024300Z     scale_ub=1200.0,
2025-05-07T20:32:28.7024412Z     contiguous=True,
2025-05-07T20:32:28.7024508Z     compiled=True,
2025-05-07T20:32:28.7024593Z )
2025-05-07T20:32:28.7024846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7025040Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.7025045Z 
2025-05-07T20:32:28.7025140Z     @given(
2025-05-07T20:32:28.7025276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7025392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7025538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7025672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7025803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7025903Z     )
2025-05-07T20:32:28.7026187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7026305Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7026394Z         self,
2025-05-07T20:32:28.7026481Z         T: int,
2025-05-07T20:32:28.7026578Z         D: int,
2025-05-07T20:32:28.7026692Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7026795Z         contiguous: bool,
2025-05-07T20:32:28.7026899Z         compiled: bool,
2025-05-07T20:32:28.7026988Z     ) -> None:
2025-05-07T20:32:28.7027096Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7027187Z     
2025-05-07T20:32:28.7027377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7027463Z     
2025-05-07T20:32:28.7027732Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7027879Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7029882Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7029893Z 
2025-05-07T20:32:28.7030027Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.7030032Z 
2025-05-07T20:32:28.7030155Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7030418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7030507Z     T=2048,
2025-05-07T20:32:28.7030600Z     D=7168,
2025-05-07T20:32:28.7030694Z     scale_ub=None,
2025-05-07T20:32:28.7030791Z     contiguous=True,
2025-05-07T20:32:28.7031028Z     compiled=False,
2025-05-07T20:32:28.7031113Z )
2025-05-07T20:32:28.7031359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7031563Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7031568Z 
2025-05-07T20:32:28.7031657Z     @given(
2025-05-07T20:32:28.7031795Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7031908Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7032040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7032176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7032306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7032391Z     )
2025-05-07T20:32:28.7032683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7032792Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7032880Z         self,
2025-05-07T20:32:28.7032974Z         T: int,
2025-05-07T20:32:28.7033068Z         D: int,
2025-05-07T20:32:28.7033179Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7033287Z         contiguous: bool,
2025-05-07T20:32:28.7033385Z         compiled: bool,
2025-05-07T20:32:28.7033479Z     ) -> None:
2025-05-07T20:32:28.7033642Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7033726Z     
2025-05-07T20:32:28.7033926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7034011Z     
2025-05-07T20:32:28.7034115Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.7036123Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7036135Z 
2025-05-07T20:32:28.7036272Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.7036277Z 
2025-05-07T20:32:28.7036400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7036654Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7036743Z     T=1,
2025-05-07T20:32:28.7036835Z     D=7168,
2025-05-07T20:32:28.7036929Z     scale_ub=1200.0,
2025-05-07T20:32:28.7037031Z     contiguous=True,
2025-05-07T20:32:28.7037128Z     compiled=False,
2025-05-07T20:32:28.7037213Z )
2025-05-07T20:32:28.7037465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7037745Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7037750Z 
2025-05-07T20:32:28.7037840Z     @given(
2025-05-07T20:32:28.7037981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7038100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7038232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7038369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7038499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7038588Z     )
2025-05-07T20:32:28.7038868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7038975Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7039068Z         self,
2025-05-07T20:32:28.7039155Z         T: int,
2025-05-07T20:32:28.7039243Z         D: int,
2025-05-07T20:32:28.7039361Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7039463Z         contiguous: bool,
2025-05-07T20:32:28.7039567Z         compiled: bool,
2025-05-07T20:32:28.7039659Z     ) -> None:
2025-05-07T20:32:28.7039770Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7039855Z     
2025-05-07T20:32:28.7040053Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7040228Z     
2025-05-07T20:32:28.7040338Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7040483Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7040587Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7040686Z         x0 = x[:, :D]
2025-05-07T20:32:28.7040777Z         x1 = x[:, D:]
2025-05-07T20:32:28.7040865Z     
2025-05-07T20:32:28.7040966Z         if contiguous:
2025-05-07T20:32:28.7041071Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7041173Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7041263Z     
2025-05-07T20:32:28.7041366Z         if scale_ub is not None:
2025-05-07T20:32:28.7041486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7041653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7041741Z             )
2025-05-07T20:32:28.7041834Z         else:
2025-05-07T20:32:28.7041941Z             scale_ub_tensor = None
2025-05-07T20:32:28.7042028Z     
2025-05-07T20:32:28.7042183Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7042287Z             op = silu_mul_quant
2025-05-07T20:32:28.7042383Z             if compiled:
2025-05-07T20:32:28.7042503Z                 op = torch.compile(op)
2025-05-07T20:32:28.7042624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7042708Z     
2025-05-07T20:32:28.7042823Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7042828Z 
2025-05-07T20:32:28.7042939Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7043090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7043211Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7043328Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7043911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7044023Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7044442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7044703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7045096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7045214Z     kernel = self.compile(
2025-05-07T20:32:28.7045654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7045856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7046210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7046216Z 
2025-05-07T20:32:28.7046452Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d02b0>
2025-05-07T20:32:28.7047335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7047923Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f76762d7b50>}
2025-05-07T20:32:28.7048773Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7048999Z context = <triton._C.libtriton.ir.context object at 0x7f76760d8f30>
2025-05-07T20:32:28.7049004Z 
2025-05-07T20:32:28.7049199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7049507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7049716Z                            module_map=module_map)
2025-05-07T20:32:28.7049904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7050025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7050113Z E       ^
2025-05-07T20:32:28.7050519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7050524Z 
2025-05-07T20:32:28.7051001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7051006Z 
2025-05-07T20:32:28.7051125Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7051383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7051477Z     T=128,
2025-05-07T20:32:28.7051565Z     D=5120,
2025-05-07T20:32:28.7051666Z     scale_ub=None,
2025-05-07T20:32:28.7051766Z     contiguous=True,
2025-05-07T20:32:28.7051862Z     compiled=False,
2025-05-07T20:32:28.7051957Z )
2025-05-07T20:32:28.7052205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7052407Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7052412Z 
2025-05-07T20:32:28.7052501Z     @given(
2025-05-07T20:32:28.7052640Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7052757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7052890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7053026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7053165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7053248Z     )
2025-05-07T20:32:28.7053533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7053646Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7053734Z         self,
2025-05-07T20:32:28.7053827Z         T: int,
2025-05-07T20:32:28.7053914Z         D: int,
2025-05-07T20:32:28.7054036Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7054147Z         contiguous: bool,
2025-05-07T20:32:28.7054246Z         compiled: bool,
2025-05-07T20:32:28.7054337Z     ) -> None:
2025-05-07T20:32:28.7054451Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7054535Z     
2025-05-07T20:32:28.7054728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7054820Z     
2025-05-07T20:32:28.7054925Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7055068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7055176Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7055267Z         x0 = x[:, :D]
2025-05-07T20:32:28.7055358Z         x1 = x[:, D:]
2025-05-07T20:32:28.7055446Z     
2025-05-07T20:32:28.7055640Z         if contiguous:
2025-05-07T20:32:28.7055754Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7055858Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7055941Z     
2025-05-07T20:32:28.7056050Z         if scale_ub is not None:
2025-05-07T20:32:28.7056176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7056331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7056424Z             )
2025-05-07T20:32:28.7056512Z         else:
2025-05-07T20:32:28.7056621Z             scale_ub_tensor = None
2025-05-07T20:32:28.7056710Z     
2025-05-07T20:32:28.7056858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7056961Z             op = silu_mul_quant
2025-05-07T20:32:28.7057064Z             if compiled:
2025-05-07T20:32:28.7057179Z                 op = torch.compile(op)
2025-05-07T20:32:28.7057307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7057392Z     
2025-05-07T20:32:28.7057502Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7057507Z 
2025-05-07T20:32:28.7057626Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7057774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7057980Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7058101Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7058694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7058821Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7059226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7059481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7059875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7059982Z     kernel = self.compile(
2025-05-07T20:32:28.7060421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7060628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7060776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7060781Z 
2025-05-07T20:32:28.7061016Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d39d0>
2025-05-07T20:32:28.7061891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7062455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676008670>}
2025-05-07T20:32:28.7063307Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7063522Z context = <triton._C.libtriton.ir.context object at 0x7f7676093030>
2025-05-07T20:32:28.7063533Z 
2025-05-07T20:32:28.7063727Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7064027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7064149Z                            module_map=module_map)
2025-05-07T20:32:28.7064335Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7064449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7064544Z E       ^
2025-05-07T20:32:28.7064948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7064952Z 
2025-05-07T20:32:28.7065510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7065515Z 
2025-05-07T20:32:28.7065641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7065903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7065996Z     T=128,
2025-05-07T20:32:28.7066084Z     D=7168,
2025-05-07T20:32:28.7066181Z     scale_ub=None,
2025-05-07T20:32:28.7066283Z     contiguous=True,
2025-05-07T20:32:28.7066380Z     compiled=False,
2025-05-07T20:32:28.7066464Z )
2025-05-07T20:32:28.7066714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7066907Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7066912Z 
2025-05-07T20:32:28.7067003Z     @given(
2025-05-07T20:32:28.7067142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7067259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7067401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7067535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7067665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7067840Z     )
2025-05-07T20:32:28.7068119Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7068228Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7068320Z         self,
2025-05-07T20:32:28.7068406Z         T: int,
2025-05-07T20:32:28.7068493Z         D: int,
2025-05-07T20:32:28.7068610Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7068712Z         contiguous: bool,
2025-05-07T20:32:28.7068811Z         compiled: bool,
2025-05-07T20:32:28.7068904Z     ) -> None:
2025-05-07T20:32:28.7069012Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7069100Z     
2025-05-07T20:32:28.7069290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7069375Z     
2025-05-07T20:32:28.7069489Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7069634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7069736Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7069834Z         x0 = x[:, :D]
2025-05-07T20:32:28.7069932Z         x1 = x[:, D:]
2025-05-07T20:32:28.7070015Z     
2025-05-07T20:32:28.7070112Z         if contiguous:
2025-05-07T20:32:28.7070218Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7070318Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7070406Z     
2025-05-07T20:32:28.7070508Z         if scale_ub is not None:
2025-05-07T20:32:28.7070627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7070788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7070874Z             )
2025-05-07T20:32:28.7070966Z         else:
2025-05-07T20:32:28.7071074Z             scale_ub_tensor = None
2025-05-07T20:32:28.7071157Z     
2025-05-07T20:32:28.7071318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7071422Z             op = silu_mul_quant
2025-05-07T20:32:28.7071520Z             if compiled:
2025-05-07T20:32:28.7071636Z                 op = torch.compile(op)
2025-05-07T20:32:28.7071759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7071848Z     
2025-05-07T20:32:28.7071958Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7071963Z 
2025-05-07T20:32:28.7072072Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7072223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7072338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7072452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7073025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7073136Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7073685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7073947Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7074333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7074450Z     kernel = self.compile(
2025-05-07T20:32:28.7074885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7075085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7075237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7075242Z 
2025-05-07T20:32:28.7075473Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760d2080>
2025-05-07T20:32:28.7076355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7076921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676008ee0>}
2025-05-07T20:32:28.7077874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7078097Z context = <triton._C.libtriton.ir.context object at 0x7f76760455b0>
2025-05-07T20:32:28.7078102Z 
2025-05-07T20:32:28.7078290Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7078624Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7078772Z                            module_map=module_map)
2025-05-07T20:32:28.7078985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7079104Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7079193Z E       ^
2025-05-07T20:32:28.7079596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7079612Z 
2025-05-07T20:32:28.7080082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7080086Z 
2025-05-07T20:32:28.7080207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7080464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7080553Z     T=2048,
2025-05-07T20:32:28.7080642Z     D=7168,
2025-05-07T20:32:28.7080743Z     scale_ub=1200.0,
2025-05-07T20:32:28.7080840Z     contiguous=True,
2025-05-07T20:32:28.7080938Z     compiled=False,
2025-05-07T20:32:28.7081030Z )
2025-05-07T20:32:28.7086281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7086505Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7086511Z 
2025-05-07T20:32:28.7086607Z     @given(
2025-05-07T20:32:28.7086745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7086867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7086998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7087131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7087264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7087350Z     )
2025-05-07T20:32:28.7087631Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7087744Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7087834Z         self,
2025-05-07T20:32:28.7087922Z         T: int,
2025-05-07T20:32:28.7088014Z         D: int,
2025-05-07T20:32:28.7088128Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7088343Z         contiguous: bool,
2025-05-07T20:32:28.7088443Z         compiled: bool,
2025-05-07T20:32:28.7088533Z     ) -> None:
2025-05-07T20:32:28.7088644Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7088727Z     
2025-05-07T20:32:28.7088926Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7090954Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7090961Z 
2025-05-07T20:32:28.7091098Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7091109Z 
2025-05-07T20:32:28.7091234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7091489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7091576Z     T=1,
2025-05-07T20:32:28.7091754Z     D=5120,
2025-05-07T20:32:28.7091850Z     scale_ub=1200.0,
2025-05-07T20:32:28.7091945Z     contiguous=True,
2025-05-07T20:32:28.7092046Z     compiled=False,
2025-05-07T20:32:28.7092129Z )
2025-05-07T20:32:28.7092378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7092568Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7092573Z 
2025-05-07T20:32:28.7092661Z     @given(
2025-05-07T20:32:28.7092799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7092911Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7093042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7093185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7093317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7093403Z     )
2025-05-07T20:32:28.7093684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7093798Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7093890Z         self,
2025-05-07T20:32:28.7093976Z         T: int,
2025-05-07T20:32:28.7094062Z         D: int,
2025-05-07T20:32:28.7094177Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7094280Z         contiguous: bool,
2025-05-07T20:32:28.7094376Z         compiled: bool,
2025-05-07T20:32:28.7094469Z     ) -> None:
2025-05-07T20:32:28.7094577Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7094661Z     
2025-05-07T20:32:28.7094856Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7094944Z     
2025-05-07T20:32:28.7095051Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7095197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7095305Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7095400Z         x0 = x[:, :D]
2025-05-07T20:32:28.7095491Z         x1 = x[:, D:]
2025-05-07T20:32:28.7095574Z     
2025-05-07T20:32:28.7095677Z         if contiguous:
2025-05-07T20:32:28.7095784Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7095884Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7095969Z     
2025-05-07T20:32:28.7096071Z         if scale_ub is not None:
2025-05-07T20:32:28.7096193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7096351Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7096437Z             )
2025-05-07T20:32:28.7096529Z         else:
2025-05-07T20:32:28.7096635Z             scale_ub_tensor = None
2025-05-07T20:32:28.7096717Z     
2025-05-07T20:32:28.7096867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7096973Z             op = silu_mul_quant
2025-05-07T20:32:28.7097070Z             if compiled:
2025-05-07T20:32:28.7097280Z                 op = torch.compile(op)
2025-05-07T20:32:28.7097404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7097485Z     
2025-05-07T20:32:28.7097593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7097602Z 
2025-05-07T20:32:28.7097715Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7097866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7097981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7098097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7098678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7098790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7099200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7099464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7099853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7099964Z     kernel = self.compile(
2025-05-07T20:32:28.7100484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7100685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7100832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7100838Z 
2025-05-07T20:32:28.7101069Z self = <triton.compiler.compiler.ASTSource object at 0x7f76760fb760>
2025-05-07T20:32:28.7101956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7102535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7676009e10>}
2025-05-07T20:32:28.7103382Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7103610Z context = <triton._C.libtriton.ir.context object at 0x7f7676135eb0>
2025-05-07T20:32:28.7103615Z 
2025-05-07T20:32:28.7103804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7104106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7104232Z                            module_map=module_map)
2025-05-07T20:32:28.7104415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7104531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7104618Z E       ^
2025-05-07T20:32:28.7105030Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7105035Z 
2025-05-07T20:32:28.7105502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7105512Z 
2025-05-07T20:32:28.7105631Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7105889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7105978Z     T=2048,
2025-05-07T20:32:28.7106064Z     D=5120,
2025-05-07T20:32:28.7106160Z     scale_ub=None,
2025-05-07T20:32:28.7106258Z     contiguous=True,
2025-05-07T20:32:28.7106359Z     compiled=False,
2025-05-07T20:32:28.7106442Z )
2025-05-07T20:32:28.7106688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7106888Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7106981Z 
2025-05-07T20:32:28.7107069Z     @given(
2025-05-07T20:32:28.7107204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7107321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7107462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7107595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7107728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7107813Z     )
2025-05-07T20:32:28.7108096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7108202Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7108289Z         self,
2025-05-07T20:32:28.7108379Z         T: int,
2025-05-07T20:32:28.7108465Z         D: int,
2025-05-07T20:32:28.7108578Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7108687Z         contiguous: bool,
2025-05-07T20:32:28.7108786Z         compiled: bool,
2025-05-07T20:32:28.7108873Z     ) -> None:
2025-05-07T20:32:28.7108992Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7109075Z     
2025-05-07T20:32:28.7109265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7109354Z     
2025-05-07T20:32:28.7109564Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.7111592Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7111599Z 
2025-05-07T20:32:28.7111736Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.7111741Z 
2025-05-07T20:32:28.7111867Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7112122Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7112212Z     T=16384,
2025-05-07T20:32:28.7112304Z     D=5120,
2025-05-07T20:32:28.7112404Z     scale_ub=None,
2025-05-07T20:32:28.7112501Z     contiguous=True,
2025-05-07T20:32:28.7112600Z     compiled=False,
2025-05-07T20:32:28.7112682Z )
2025-05-07T20:32:28.7112929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7113133Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7113138Z 
2025-05-07T20:32:28.7113226Z     @given(
2025-05-07T20:32:28.7113363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7113476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7113679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7113816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7113953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7114038Z     )
2025-05-07T20:32:28.7114321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7114436Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7114523Z         self,
2025-05-07T20:32:28.7114618Z         T: int,
2025-05-07T20:32:28.7114705Z         D: int,
2025-05-07T20:32:28.7114819Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7114921Z         contiguous: bool,
2025-05-07T20:32:28.7115018Z         compiled: bool,
2025-05-07T20:32:28.7115109Z     ) -> None:
2025-05-07T20:32:28.7115216Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7115299Z     
2025-05-07T20:32:28.7115495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7117615Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7117626Z 
2025-05-07T20:32:28.7117765Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7117770Z 
2025-05-07T20:32:28.7117888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7118141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7118234Z     T=4096,
2025-05-07T20:32:28.7118320Z     D=5120,
2025-05-07T20:32:28.7118416Z     scale_ub=None,
2025-05-07T20:32:28.7118514Z     contiguous=True,
2025-05-07T20:32:28.7118610Z     compiled=False,
2025-05-07T20:32:28.7118702Z )
2025-05-07T20:32:28.7118953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7119148Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7119153Z 
2025-05-07T20:32:28.7119242Z     @given(
2025-05-07T20:32:28.7119489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7119604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7119739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7119871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7120008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7120094Z     )
2025-05-07T20:32:28.7120375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7120485Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7120572Z         self,
2025-05-07T20:32:28.7120658Z         T: int,
2025-05-07T20:32:28.7120749Z         D: int,
2025-05-07T20:32:28.7120859Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7120964Z         contiguous: bool,
2025-05-07T20:32:28.7121063Z         compiled: bool,
2025-05-07T20:32:28.7121151Z     ) -> None:
2025-05-07T20:32:28.7121259Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7121353Z     
2025-05-07T20:32:28.7121546Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7123563Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7123570Z 
2025-05-07T20:32:28.7123706Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7123711Z 
2025-05-07T20:32:28.7124191Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7124476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7124573Z     T=2048,
2025-05-07T20:32:28.7124664Z     D=5120,
2025-05-07T20:32:28.7124759Z     scale_ub=None,
2025-05-07T20:32:28.7124860Z     contiguous=False,
2025-05-07T20:32:28.7124961Z     compiled=False,
2025-05-07T20:32:28.7125043Z )
2025-05-07T20:32:28.7125289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7125489Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.7125494Z 
2025-05-07T20:32:28.7125582Z     @given(
2025-05-07T20:32:28.7125720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7125832Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7125962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7126254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7126389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7126473Z     )
2025-05-07T20:32:28.7126758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7126870Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7126956Z         self,
2025-05-07T20:32:28.7127047Z         T: int,
2025-05-07T20:32:28.7127133Z         D: int,
2025-05-07T20:32:28.7127247Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7127348Z         contiguous: bool,
2025-05-07T20:32:28.7127446Z         compiled: bool,
2025-05-07T20:32:28.7127536Z     ) -> None:
2025-05-07T20:32:28.7127643Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7127725Z     
2025-05-07T20:32:28.7127918Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7129945Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7130079Z 
2025-05-07T20:32:28.7130220Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7130225Z 
2025-05-07T20:32:28.7130342Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7130594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7130685Z     T=4096,
2025-05-07T20:32:28.7130772Z     D=7168,
2025-05-07T20:32:28.7130869Z     scale_ub=None,
2025-05-07T20:32:28.7130965Z     contiguous=True,
2025-05-07T20:32:28.7131058Z     compiled=True,
2025-05-07T20:32:28.7131150Z )
2025-05-07T20:32:28.7131395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7131587Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.7131598Z 
2025-05-07T20:32:28.7131688Z     @given(
2025-05-07T20:32:28.7131822Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7131935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7132068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7132199Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7132331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7132416Z     )
2025-05-07T20:32:28.7132697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7132808Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7132895Z         self,
2025-05-07T20:32:28.7132984Z         T: int,
2025-05-07T20:32:28.7133075Z         D: int,
2025-05-07T20:32:28.7133191Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7133292Z         contiguous: bool,
2025-05-07T20:32:28.7133393Z         compiled: bool,
2025-05-07T20:32:28.7133480Z     ) -> None:
2025-05-07T20:32:28.7133592Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7133678Z     
2025-05-07T20:32:28.7133869Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7135896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7135995Z 
2025-05-07T20:32:28.7136131Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7136135Z 
2025-05-07T20:32:28.7136256Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7136515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7136603Z     T=2048,
2025-05-07T20:32:28.7136694Z     D=5120,
2025-05-07T20:32:28.7136788Z     scale_ub=1200.0,
2025-05-07T20:32:28.7136886Z     contiguous=False,
2025-05-07T20:32:28.7136982Z     compiled=False,
2025-05-07T20:32:28.7137065Z )
2025-05-07T20:32:28.7137311Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7137515Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.7137519Z 
2025-05-07T20:32:28.7137606Z     @given(
2025-05-07T20:32:28.7137743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7137856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7137991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7138130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7138260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7138434Z     )
2025-05-07T20:32:28.7138720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7138826Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7138911Z         self,
2025-05-07T20:32:28.7139001Z         T: int,
2025-05-07T20:32:28.7139087Z         D: int,
2025-05-07T20:32:28.7139201Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7139301Z         contiguous: bool,
2025-05-07T20:32:28.7139398Z         compiled: bool,
2025-05-07T20:32:28.7139490Z     ) -> None:
2025-05-07T20:32:28.7139597Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7139680Z     
2025-05-07T20:32:28.7139874Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7141889Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7141901Z 
2025-05-07T20:32:28.7142039Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7142044Z 
2025-05-07T20:32:28.7142160Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7142411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7142501Z     T=4096,
2025-05-07T20:32:28.7142587Z     D=7168,
2025-05-07T20:32:28.7142686Z     scale_ub=1200.0,
2025-05-07T20:32:28.7142785Z     contiguous=True,
2025-05-07T20:32:28.7142880Z     compiled=False,
2025-05-07T20:32:28.7142965Z )
2025-05-07T20:32:28.7143214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7143415Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7143421Z 
2025-05-07T20:32:28.7143511Z     @given(
2025-05-07T20:32:28.7143645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7143758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7143892Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7144024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7144157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7144241Z     )
2025-05-07T20:32:28.7144524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7144631Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7144717Z         self,
2025-05-07T20:32:28.7144894Z         T: int,
2025-05-07T20:32:28.7144987Z         D: int,
2025-05-07T20:32:28.7145098Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7145199Z         contiguous: bool,
2025-05-07T20:32:28.7145306Z         compiled: bool,
2025-05-07T20:32:28.7145393Z     ) -> None:
2025-05-07T20:32:28.7145500Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7145588Z     
2025-05-07T20:32:28.7145779Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7147801Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7147807Z 
2025-05-07T20:32:28.7147941Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7147946Z 
2025-05-07T20:32:28.7148158Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7148436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7148544Z     T=16384,
2025-05-07T20:32:28.7148640Z     D=7168,
2025-05-07T20:32:28.7148734Z     scale_ub=None,
2025-05-07T20:32:28.7148832Z     contiguous=False,
2025-05-07T20:32:28.7148929Z     compiled=True,
2025-05-07T20:32:28.7149013Z )
2025-05-07T20:32:28.7149258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7149462Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.7149467Z 
2025-05-07T20:32:28.7149555Z     @given(
2025-05-07T20:32:28.7149692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7149813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7149943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7150079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7150214Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7150300Z     )
2025-05-07T20:32:28.7150587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7150693Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7150779Z         self,
2025-05-07T20:32:28.7150870Z         T: int,
2025-05-07T20:32:28.7150955Z         D: int,
2025-05-07T20:32:28.7151070Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7151173Z         contiguous: bool,
2025-05-07T20:32:28.7151269Z         compiled: bool,
2025-05-07T20:32:28.7151360Z     ) -> None:
2025-05-07T20:32:28.7151468Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7151551Z     
2025-05-07T20:32:28.7151760Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7153835Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7153848Z 
2025-05-07T20:32:28.7153984Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7153989Z 
2025-05-07T20:32:28.7154105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7154355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7154446Z     T=4096,
2025-05-07T20:32:28.7154532Z     D=7168,
2025-05-07T20:32:28.7154739Z     scale_ub=None,
2025-05-07T20:32:28.7154841Z     contiguous=True,
2025-05-07T20:32:28.7154935Z     compiled=False,
2025-05-07T20:32:28.7155022Z )
2025-05-07T20:32:28.7155269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7155468Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7155473Z 
2025-05-07T20:32:28.7155561Z     @given(
2025-05-07T20:32:28.7155695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7155807Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7155940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7156071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7156203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7156287Z     )
2025-05-07T20:32:28.7156566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7156680Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7156769Z         self,
2025-05-07T20:32:28.7156858Z         T: int,
2025-05-07T20:32:28.7156951Z         D: int,
2025-05-07T20:32:28.7157064Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7157255Z         contiguous: bool,
2025-05-07T20:32:28.7157361Z         compiled: bool,
2025-05-07T20:32:28.7157451Z     ) -> None:
2025-05-07T20:32:28.7157561Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7157647Z     
2025-05-07T20:32:28.7157842Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7159875Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7159881Z 
2025-05-07T20:32:28.7160017Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7160028Z 
2025-05-07T20:32:28.7160152Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7160412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7160504Z     T=16384,
2025-05-07T20:32:28.7160603Z     D=7168,
2025-05-07T20:32:28.7160700Z     scale_ub=None,
2025-05-07T20:32:28.7160800Z     contiguous=True,
2025-05-07T20:32:28.7160901Z     compiled=False,
2025-05-07T20:32:28.7160985Z )
2025-05-07T20:32:28.7161233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7161440Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7161445Z 
2025-05-07T20:32:28.7161535Z     @given(
2025-05-07T20:32:28.7161678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7161792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7161923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7162069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7162200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7162287Z     )
2025-05-07T20:32:28.7162574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7162681Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7162769Z         self,
2025-05-07T20:32:28.7162863Z         T: int,
2025-05-07T20:32:28.7162951Z         D: int,
2025-05-07T20:32:28.7163069Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7163171Z         contiguous: bool,
2025-05-07T20:32:28.7163270Z         compiled: bool,
2025-05-07T20:32:28.7163365Z     ) -> None:
2025-05-07T20:32:28.7163472Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7163556Z     
2025-05-07T20:32:28.7163846Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7165872Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7165884Z 
2025-05-07T20:32:28.7166026Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7166031Z 
2025-05-07T20:32:28.7166150Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7166407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7166506Z     T=16384,
2025-05-07T20:32:28.7166595Z     D=7168,
2025-05-07T20:32:28.7166691Z     scale_ub=1200.0,
2025-05-07T20:32:28.7166794Z     contiguous=True,
2025-05-07T20:32:28.7166889Z     compiled=False,
2025-05-07T20:32:28.7167064Z )
2025-05-07T20:32:28.7167312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7167514Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7167518Z 
2025-05-07T20:32:28.7167611Z     @given(
2025-05-07T20:32:28.7167747Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7167860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7167998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7168131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7168267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7168353Z     )
2025-05-07T20:32:28.7168640Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7168755Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7168845Z         self,
2025-05-07T20:32:28.7168933Z         T: int,
2025-05-07T20:32:28.7169034Z         D: int,
2025-05-07T20:32:28.7169147Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7169250Z         contiguous: bool,
2025-05-07T20:32:28.7169354Z         compiled: bool,
2025-05-07T20:32:28.7169444Z     ) -> None:
2025-05-07T20:32:28.7169554Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7169643Z     
2025-05-07T20:32:28.7169837Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7171862Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7171874Z 
2025-05-07T20:32:28.7172009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7172014Z 
2025-05-07T20:32:28.7172139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7172395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7172485Z     T=128,
2025-05-07T20:32:28.7172579Z     D=5120,
2025-05-07T20:32:28.7172676Z     scale_ub=1200.0,
2025-05-07T20:32:28.7172776Z     contiguous=False,
2025-05-07T20:32:28.7172879Z     compiled=False,
2025-05-07T20:32:28.7172967Z )
2025-05-07T20:32:28.7173215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7173416Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.7173512Z 
2025-05-07T20:32:28.7173600Z     @given(
2025-05-07T20:32:28.7173742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7173856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7173996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7174135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7174267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7174355Z     )
2025-05-07T20:32:28.7174639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7174748Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7174835Z         self,
2025-05-07T20:32:28.7174928Z         T: int,
2025-05-07T20:32:28.7175018Z         D: int,
2025-05-07T20:32:28.7175135Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7175238Z         contiguous: bool,
2025-05-07T20:32:28.7175336Z         compiled: bool,
2025-05-07T20:32:28.7175430Z     ) -> None:
2025-05-07T20:32:28.7175548Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7175631Z     
2025-05-07T20:32:28.7175832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7175919Z     
2025-05-07T20:32:28.7176110Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7176261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7176363Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7176456Z         x0 = x[:, :D]
2025-05-07T20:32:28.7176552Z         x1 = x[:, D:]
2025-05-07T20:32:28.7176637Z     
2025-05-07T20:32:28.7176733Z         if contiguous:
2025-05-07T20:32:28.7176841Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7176945Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7177034Z     
2025-05-07T20:32:28.7177137Z         if scale_ub is not None:
2025-05-07T20:32:28.7177257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7177420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7177512Z             )
2025-05-07T20:32:28.7177601Z         else:
2025-05-07T20:32:28.7177714Z             scale_ub_tensor = None
2025-05-07T20:32:28.7177799Z     
2025-05-07T20:32:28.7177948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7178067Z             op = silu_mul_quant
2025-05-07T20:32:28.7178166Z             if compiled:
2025-05-07T20:32:28.7178284Z                 op = torch.compile(op)
2025-05-07T20:32:28.7178415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7178506Z     
2025-05-07T20:32:28.7178637Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7178642Z 
2025-05-07T20:32:28.7178777Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7178926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7179048Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7179166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7179749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7179868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7180281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7180552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7180943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7181052Z     kernel = self.compile(
2025-05-07T20:32:28.7181499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7181703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7181846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7181857Z 
2025-05-07T20:32:28.7182176Z self = <triton.compiler.compiler.ASTSource object at 0x7f7675f63250>
2025-05-07T20:32:28.7183065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7183647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7675f9dcf0>}
2025-05-07T20:32:28.7184493Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7184718Z context = <triton._C.libtriton.ir.context object at 0x7f7675e4dbb0>
2025-05-07T20:32:28.7184724Z 
2025-05-07T20:32:28.7184913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7185223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7185354Z                            module_map=module_map)
2025-05-07T20:32:28.7185540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7185745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7185834Z E       ^
2025-05-07T20:32:28.7186241Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7186247Z 
2025-05-07T20:32:28.7186721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7186726Z 
2025-05-07T20:32:28.7186846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7187100Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7187194Z     T=2048,
2025-05-07T20:32:28.7187279Z     D=7168,
2025-05-07T20:32:28.7187387Z     scale_ub=None,
2025-05-07T20:32:28.7187488Z     contiguous=False,
2025-05-07T20:32:28.7187585Z     compiled=False,
2025-05-07T20:32:28.7187676Z )
2025-05-07T20:32:28.7187924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7188131Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.7188135Z 
2025-05-07T20:32:28.7188228Z     @given(
2025-05-07T20:32:28.7188364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7188480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7188618Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7188754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7188890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7188976Z     )
2025-05-07T20:32:28.7189258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7189373Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7189465Z         self,
2025-05-07T20:32:28.7189553Z         T: int,
2025-05-07T20:32:28.7189648Z         D: int,
2025-05-07T20:32:28.7189762Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7189865Z         contiguous: bool,
2025-05-07T20:32:28.7189973Z         compiled: bool,
2025-05-07T20:32:28.7190064Z     ) -> None:
2025-05-07T20:32:28.7190174Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7190267Z     
2025-05-07T20:32:28.7190463Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7192593Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7192600Z 
2025-05-07T20:32:28.7192739Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7192747Z 
2025-05-07T20:32:28.7192872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7193131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7193222Z     T=128,
2025-05-07T20:32:28.7193317Z     D=7168,
2025-05-07T20:32:28.7193414Z     scale_ub=1200.0,
2025-05-07T20:32:28.7193560Z     contiguous=True,
2025-05-07T20:32:28.7193662Z     compiled=True,
2025-05-07T20:32:28.7193750Z )
2025-05-07T20:32:28.7193999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7194195Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.7194201Z 
2025-05-07T20:32:28.7194291Z     @given(
2025-05-07T20:32:28.7194438Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7194554Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7194687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7194827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7195048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7195135Z     )
2025-05-07T20:32:28.7195419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7195527Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7195614Z         self,
2025-05-07T20:32:28.7195709Z         T: int,
2025-05-07T20:32:28.7195797Z         D: int,
2025-05-07T20:32:28.7195915Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7196019Z         contiguous: bool,
2025-05-07T20:32:28.7196117Z         compiled: bool,
2025-05-07T20:32:28.7196212Z     ) -> None:
2025-05-07T20:32:28.7196322Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7196410Z     
2025-05-07T20:32:28.7196615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7196701Z     
2025-05-07T20:32:28.7196807Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7196959Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7197068Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7197161Z         x0 = x[:, :D]
2025-05-07T20:32:28.7197257Z         x1 = x[:, D:]
2025-05-07T20:32:28.7197341Z     
2025-05-07T20:32:28.7197444Z         if contiguous:
2025-05-07T20:32:28.7197549Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7197651Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7197742Z     
2025-05-07T20:32:28.7197847Z         if scale_ub is not None:
2025-05-07T20:32:28.7197968Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7198128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7198216Z             )
2025-05-07T20:32:28.7198304Z         else:
2025-05-07T20:32:28.7198418Z             scale_ub_tensor = None
2025-05-07T20:32:28.7198505Z     
2025-05-07T20:32:28.7198654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7198766Z             op = silu_mul_quant
2025-05-07T20:32:28.7198864Z             if compiled:
2025-05-07T20:32:28.7198984Z                 op = torch.compile(op)
2025-05-07T20:32:28.7199111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7199195Z     
2025-05-07T20:32:28.7199304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7199309Z 
2025-05-07T20:32:28.7199420Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7199565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7199688Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7199803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7200230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.7200346Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.7201084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7201204Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7201624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7201879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7202275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7202385Z     kernel = self.compile(
2025-05-07T20:32:28.7202821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7203033Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7203180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7203190Z 
2025-05-07T20:32:28.7203434Z self = <triton.compiler.compiler.ASTSource object at 0x7f7675eab400>
2025-05-07T20:32:28.7204320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7204981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f76dc696c20>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f7675f9f0a0>}
2025-05-07T20:32:28.7205827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7206046Z context = <triton._C.libtriton.ir.context object at 0x7f7675e804f0>
2025-05-07T20:32:28.7206051Z 
2025-05-07T20:32:28.7206253Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7206557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7206686Z                            module_map=module_map)
2025-05-07T20:32:28.7206878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7206993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7207089Z E       ^
2025-05-07T20:32:28.7207492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7207497Z 
2025-05-07T20:32:28.7207969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7207979Z 
2025-05-07T20:32:28.7208100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7208358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7208454Z     T=128,
2025-05-07T20:32:28.7208549Z     D=7168,
2025-05-07T20:32:28.7208646Z     scale_ub=1200.0,
2025-05-07T20:32:28.7208750Z     contiguous=True,
2025-05-07T20:32:28.7208848Z     compiled=False,
2025-05-07T20:32:28.7208932Z )
2025-05-07T20:32:28.7209197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7209396Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7209401Z 
2025-05-07T20:32:28.7209497Z     @given(
2025-05-07T20:32:28.7209635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7209750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7209887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7210024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7210155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7210245Z     )
2025-05-07T20:32:28.7210531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7210737Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7210833Z         self,
2025-05-07T20:32:28.7210922Z         T: int,
2025-05-07T20:32:28.7211010Z         D: int,
2025-05-07T20:32:28.7211130Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7211236Z         contiguous: bool,
2025-05-07T20:32:28.7211338Z         compiled: bool,
2025-05-07T20:32:28.7211428Z     ) -> None:
2025-05-07T20:32:28.7211537Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7211624Z     
2025-05-07T20:32:28.7211816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7211902Z     
2025-05-07T20:32:28.7212014Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7212159Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7214179Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7214273Z 
2025-05-07T20:32:28.7214412Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.7214416Z 
2025-05-07T20:32:28.7214535Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7214793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7214882Z     T=128,
2025-05-07T20:32:28.7214976Z     D=5120,
2025-05-07T20:32:28.7215073Z     scale_ub=1200.0,
2025-05-07T20:32:28.7215172Z     contiguous=True,
2025-05-07T20:32:28.7215274Z     compiled=True,
2025-05-07T20:32:28.7215360Z )
2025-05-07T20:32:28.7215614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7215810Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.7215816Z 
2025-05-07T20:32:28.7215905Z     @given(
2025-05-07T20:32:28.7216042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7216171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7216304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7216444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7216575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7216660Z     )
2025-05-07T20:32:28.7216946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7217054Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7217143Z         self,
2025-05-07T20:32:28.7217236Z         T: int,
2025-05-07T20:32:28.7217322Z         D: int,
2025-05-07T20:32:28.7217434Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7217546Z         contiguous: bool,
2025-05-07T20:32:28.7217646Z         compiled: bool,
2025-05-07T20:32:28.7217736Z     ) -> None:
2025-05-07T20:32:28.7217852Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7217935Z     
2025-05-07T20:32:28.7218139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7218229Z     
2025-05-07T20:32:28.7218335Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.7220339Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7220433Z 
2025-05-07T20:32:28.7220572Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.7220577Z 
2025-05-07T20:32:28.7220704Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7220959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7221054Z     T=128,
2025-05-07T20:32:28.7221147Z     D=7168,
2025-05-07T20:32:28.7221245Z     scale_ub=None,
2025-05-07T20:32:28.7221346Z     contiguous=True,
2025-05-07T20:32:28.7221453Z     compiled=True,
2025-05-07T20:32:28.7221539Z )
2025-05-07T20:32:28.7221788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7221984Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.7221989Z 
2025-05-07T20:32:28.7222078Z     @given(
2025-05-07T20:32:28.7222224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7222338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7222475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7222616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7222749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7222837Z     )
2025-05-07T20:32:28.7223238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7223348Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7223441Z         self,
2025-05-07T20:32:28.7223530Z         T: int,
2025-05-07T20:32:28.7223618Z         D: int,
2025-05-07T20:32:28.7223738Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7224157Z         contiguous: bool,
2025-05-07T20:32:28.7224281Z         compiled: bool,
2025-05-07T20:32:28.7224379Z     ) -> None:
2025-05-07T20:32:28.7224487Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7224571Z     
2025-05-07T20:32:28.7224772Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7226776Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7226789Z 
2025-05-07T20:32:28.7226931Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7227086Z =============================== warnings summary ===============================
2025-05-07T20:32:28.7227444Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:28.7227790Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:28.7228132Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:28.7229132Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:28.7229402Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:28.7229407Z 
2025-05-07T20:32:28.7229617Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:28.7236355Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:28.7236607Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:28.7236613Z 
2025-05-07T20:32:28.7236866Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:28.7237063Z ================== 1 failed, 1 passed, 13 warnings in 31.65s ===================
2025-05-07T20:32:30.5517671Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:30.6158072Z 
2025-05-07T20:32:30.6158661Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:30.6159196Z 
2025-05-07T20:32:30.6159201Z 
2025-05-07T20:32:30.6181850Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:32.7903301Z ============================= test session starts ==============================
2025-05-07T20:32:32.7904374Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:32.7905725Z cachedir: .pytest_cache
2025-05-07T20:32:32.7906716Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:32.7908029Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:32.7908764Z plugins: hypothesis-6.131.14
2025-05-07T20:32:34.4261417Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:34.6081929Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:34.6082530Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:34.6082856Z 
2025-05-07T20:32:36.9115778Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.9117022Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:36.9118513Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.9120099Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.9121654Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.9123163Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9124875Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.9126369Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9127914Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.9129636Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:36.9131003Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.9132390Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:36.9133525Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:36.9134654Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:36.9136000Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.9137412Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.9138795Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:36.9139937Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:36.9141241Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.9142745Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.9143912Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9144924Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9145734Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:36.9146859Z W0507 20:32:36.910000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.9289972Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.9291135Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:36.9292605Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.9294157Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.9295673Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.9297371Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.9298804Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.9300319Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.9301924Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.9303295Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:36.9304633Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.9306093Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:36.9307233Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:36.9308348Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:36.9309692Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.9311146Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.9312382Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:36.9313623Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:36.9314909Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.9316402Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.9317574Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.9318587Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.9319406Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:36.9320521Z W0507 20:32:36.928000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5341552Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5342978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5343465Z     T=1,
2025-05-07T20:32:37.5343686Z     D=5120,
2025-05-07T20:32:37.5343904Z     scale_ub=None,
2025-05-07T20:32:37.5344156Z     contiguous=True,
2025-05-07T20:32:37.5344430Z     compiled=True,
2025-05-07T20:32:37.5344664Z )
2025-05-07T20:32:37.5345036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.5345602Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.5345904Z 
2025-05-07T20:32:37.5346002Z     @given(
2025-05-07T20:32:37.5346268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.5346634Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.5346989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.5347368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.5347751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.5348083Z     )
2025-05-07T20:32:37.5348492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.5349010Z     def test_silu_mul_quant(
2025-05-07T20:32:37.5349294Z         self,
2025-05-07T20:32:37.5349517Z         T: int,
2025-05-07T20:32:37.5349937Z         D: int,
2025-05-07T20:32:37.5350195Z         scale_ub: Optional[float],
2025-05-07T20:32:37.5350513Z         contiguous: bool,
2025-05-07T20:32:37.5350831Z         compiled: bool,
2025-05-07T20:32:37.5351155Z     ) -> None:
2025-05-07T20:32:37.5351467Z         torch.manual_seed(2025)
2025-05-07T20:32:37.5351809Z     
2025-05-07T20:32:37.5352203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.5352687Z     
2025-05-07T20:32:37.5352907Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.5353245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.5353682Z         x = x_sign * x_clamp
2025-05-07T20:32:37.5353956Z         x0 = x[:, :D]
2025-05-07T20:32:37.5354211Z         x1 = x[:, D:]
2025-05-07T20:32:37.5354461Z     
2025-05-07T20:32:37.5354671Z         if contiguous:
2025-05-07T20:32:37.5354942Z             x0 = x0.contiguous()
2025-05-07T20:32:37.5355244Z             x1 = x1.contiguous()
2025-05-07T20:32:37.5355525Z     
2025-05-07T20:32:37.5355749Z         if scale_ub is not None:
2025-05-07T20:32:37.5356070Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.5356464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.5356817Z             )
2025-05-07T20:32:37.5357043Z         else:
2025-05-07T20:32:37.5357289Z             scale_ub_tensor = None
2025-05-07T20:32:37.5357573Z     
2025-05-07T20:32:37.5357844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5358208Z             op = silu_mul_quant
2025-05-07T20:32:37.5358492Z             if compiled:
2025-05-07T20:32:37.5358783Z                 op = torch.compile(op)
2025-05-07T20:32:37.5359129Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.5359444Z     
2025-05-07T20:32:37.5359671Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.5360003Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.5360333Z     
2025-05-07T20:32:37.5360611Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.5361003Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.5361337Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.5361700Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.5362117Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.5362479Z     
2025-05-07T20:32:37.5362709Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.5362941Z 
2025-05-07T20:32:37.5363058Z moe/activation_test.py:126: 
2025-05-07T20:32:37.5363402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5363787Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.5364264Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.5365176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.5366044Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.5366666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.5367451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.5368239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.5369060Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.5369919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:37.5370778Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.5371615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.5372340Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.5373318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.5373916Z     fn()
2025-05-07T20:32:37.5374494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.5375177Z     self.fn.run(
2025-05-07T20:32:37.5375714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.5376324Z     kernel = self.compile(
2025-05-07T20:32:37.5376937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.5377696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5378156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.5378419Z 
2025-05-07T20:32:37.5378665Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1b4d8040>
2025-05-07T20:32:37.5379900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.5381555Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1bc73400>}
2025-05-07T20:32:37.5383098Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.5384271Z context = <triton._C.libtriton.ir.context object at 0x7eff21a9f5b0>
2025-05-07T20:32:37.5384601Z 
2025-05-07T20:32:37.5384793Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.5385401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5385941Z                            module_map=module_map)
2025-05-07T20:32:37.5386360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5386771Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.5387078Z E       ^
2025-05-07T20:32:37.5387610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5388124Z 
2025-05-07T20:32:37.5388597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.5389188Z 
2025-05-07T20:32:37.5389402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.5389879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.5390342Z     T=2048,
2025-05-07T20:32:37.5390551Z     D=5120,
2025-05-07T20:32:37.5390780Z     scale_ub=1200.0,
2025-05-07T20:32:37.5391040Z     contiguous=True,
2025-05-07T20:32:37.5391294Z     compiled=False,
2025-05-07T20:32:37.5391535Z )
2025-05-07T20:32:38.5353690Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.5354933Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:38.5356466Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.5358080Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.5360036Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.5361594Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.5363053Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.5364600Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.5366192Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.5367597Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:38.5368966Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.5370315Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:38.5371528Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:38.5372678Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:38.5374041Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.5375474Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.5376939Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:38.5378110Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:38.5379638Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.5381158Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.5382341Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.5383364Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.5384205Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:38.5385351Z W0507 20:32:38.531000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.7570338Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.7571592Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:38.7573011Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.7574534Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.7576003Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.7577480Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.7578867Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.7580326Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.7581833Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.7583164Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:38.7584458Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.7585744Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:38.7587012Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:38.7588100Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:38.7589408Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.7590771Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.7591962Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:38.7593071Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:38.7594408Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.7596311Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.7597685Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.7598846Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.7599796Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:38.7601121Z W0507 20:32:38.753000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5728889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.5729483Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:39.5729781Z 
2025-05-07T20:32:39.5729868Z     @given(
2025-05-07T20:32:39.5730217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.5730576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.5730907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.5731254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.5731607Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.5731915Z     )
2025-05-07T20:32:39.5732295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.5732770Z     def test_silu_mul_quant(
2025-05-07T20:32:39.5733033Z         self,
2025-05-07T20:32:39.5733236Z         T: int,
2025-05-07T20:32:39.5733457Z         D: int,
2025-05-07T20:32:39.5733692Z         scale_ub: Optional[float],
2025-05-07T20:32:39.5733976Z         contiguous: bool,
2025-05-07T20:32:39.5734231Z         compiled: bool,
2025-05-07T20:32:39.5734473Z     ) -> None:
2025-05-07T20:32:39.5734698Z         torch.manual_seed(2025)
2025-05-07T20:32:39.5734956Z     
2025-05-07T20:32:39.5735253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.5735617Z     
2025-05-07T20:32:39.5735820Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.5736132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.5736463Z         x = x_sign * x_clamp
2025-05-07T20:32:39.5736714Z         x0 = x[:, :D]
2025-05-07T20:32:39.5737203Z         x1 = x[:, D:]
2025-05-07T20:32:39.5737433Z     
2025-05-07T20:32:39.5737628Z         if contiguous:
2025-05-07T20:32:39.5737876Z             x0 = x0.contiguous()
2025-05-07T20:32:39.5738151Z             x1 = x1.contiguous()
2025-05-07T20:32:39.5738406Z     
2025-05-07T20:32:39.5738612Z         if scale_ub is not None:
2025-05-07T20:32:39.5738907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.5739261Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.5739591Z             )
2025-05-07T20:32:39.5739800Z         else:
2025-05-07T20:32:39.5740022Z             scale_ub_tensor = None
2025-05-07T20:32:39.5740291Z     
2025-05-07T20:32:39.5740542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5740882Z             op = silu_mul_quant
2025-05-07T20:32:39.5741147Z             if compiled:
2025-05-07T20:32:39.5741420Z                 op = torch.compile(op)
2025-05-07T20:32:39.5741748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5742047Z     
2025-05-07T20:32:39.5742258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.5742434Z 
2025-05-07T20:32:39.5742548Z moe/activation_test.py:117: 
2025-05-07T20:32:39.5742867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5743376Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.5743685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5744421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.5745160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.5745737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.5746470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.5747173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.5747749Z     kernel = self.compile(
2025-05-07T20:32:39.5748335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.5749044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.5749461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5749712Z 
2025-05-07T20:32:39.5749934Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1b965300>
2025-05-07T20:32:39.5751081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.5752549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1b452ef0>}
2025-05-07T20:32:39.5754067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.5755162Z context = <triton._C.libtriton.ir.context object at 0x7eff1a535d30>
2025-05-07T20:32:39.5755472Z 
2025-05-07T20:32:39.5755649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.5756201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.5756692Z                            module_map=module_map)
2025-05-07T20:32:39.5757079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.5757455Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.5757724Z E       ^
2025-05-07T20:32:39.5758220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5758786Z 
2025-05-07T20:32:39.5759228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.5759770Z 
2025-05-07T20:32:39.5759894Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.5760334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.5760760Z     T=2048,
2025-05-07T20:32:39.5760964Z     D=5120,
2025-05-07T20:32:39.5761166Z     scale_ub=1200.0,
2025-05-07T20:32:39.5761405Z     contiguous=True,
2025-05-07T20:32:39.5761645Z     compiled=True,
2025-05-07T20:32:39.5761867Z )
2025-05-07T20:32:39.5762201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.5762727Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.5763013Z 
2025-05-07T20:32:39.5763107Z     @given(
2025-05-07T20:32:39.5763349Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.5763691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.5764021Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.5764366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.5764805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.5765112Z     )
2025-05-07T20:32:39.5765491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.5765960Z     def test_silu_mul_quant(
2025-05-07T20:32:39.5766222Z         self,
2025-05-07T20:32:39.5766462Z         T: int,
2025-05-07T20:32:39.5766676Z         D: int,
2025-05-07T20:32:39.5766910Z         scale_ub: Optional[float],
2025-05-07T20:32:39.5767211Z         contiguous: bool,
2025-05-07T20:32:39.5767478Z         compiled: bool,
2025-05-07T20:32:39.5767716Z     ) -> None:
2025-05-07T20:32:39.5767950Z         torch.manual_seed(2025)
2025-05-07T20:32:39.5768208Z     
2025-05-07T20:32:39.5768522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.5768886Z     
2025-05-07T20:32:39.5769088Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.5769400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.5769733Z         x = x_sign * x_clamp
2025-05-07T20:32:39.5769991Z         x0 = x[:, :D]
2025-05-07T20:32:39.5770231Z         x1 = x[:, D:]
2025-05-07T20:32:39.5770455Z     
2025-05-07T20:32:39.5770649Z         if contiguous:
2025-05-07T20:32:39.5770898Z             x0 = x0.contiguous()
2025-05-07T20:32:39.5771181Z             x1 = x1.contiguous()
2025-05-07T20:32:39.5771439Z     
2025-05-07T20:32:39.5778599Z         if scale_ub is not None:
2025-05-07T20:32:39.5778966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.5779443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.5779783Z             )
2025-05-07T20:32:39.5780002Z         else:
2025-05-07T20:32:39.5780232Z             scale_ub_tensor = None
2025-05-07T20:32:39.5780512Z     
2025-05-07T20:32:39.5780780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5781122Z             op = silu_mul_quant
2025-05-07T20:32:39.5781403Z             if compiled:
2025-05-07T20:32:39.5781683Z                 op = torch.compile(op)
2025-05-07T20:32:39.5782013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.5782305Z     
2025-05-07T20:32:39.5782522Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.5782834Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.5783148Z     
2025-05-07T20:32:39.5783407Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.5783767Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.5784080Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.5784418Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.5784806Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.5785135Z     
2025-05-07T20:32:39.5785478Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.5785698Z 
2025-05-07T20:32:39.5785809Z moe/activation_test.py:126: 
2025-05-07T20:32:39.5786134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5786494Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.5786847Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.5787690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.5788484Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.5789066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.5789853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.5790828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.5791602Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.5792404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:39.5793313Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.5794190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.5794866Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.5795510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.5796064Z     fn()
2025-05-07T20:32:39.5796601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.5797223Z     self.fn.run(
2025-05-07T20:32:39.5797731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.5798299Z     kernel = self.compile(
2025-05-07T20:32:39.5798875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.5799577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.5799998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.5800247Z 
2025-05-07T20:32:39.5800469Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1ba4ca30>
2025-05-07T20:32:39.5801698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.5803151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f211b0>}
2025-05-07T20:32:39.5804565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.5805646Z context = <triton._C.libtriton.ir.context object at 0x7eff09d891b0>
2025-05-07T20:32:39.5805955Z 
2025-05-07T20:32:39.5806132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.5806687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.5807188Z                            module_map=module_map)
2025-05-07T20:32:39.5807577Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.5807959Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.5808249Z E       ^
2025-05-07T20:32:39.5808838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.5809320Z 
2025-05-07T20:32:39.5809757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.5810303Z 
2025-05-07T20:32:39.5810416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.5810864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.5811285Z     T=16384,
2025-05-07T20:32:39.5811517Z     D=7168,
2025-05-07T20:32:39.5811772Z     scale_ub=1200.0,
2025-05-07T20:32:39.5812014Z     contiguous=False,
2025-05-07T20:32:39.5812258Z     compiled=False,
2025-05-07T20:32:39.5812481Z )
2025-05-07T20:32:40.1348516Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.1349655Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:40.1351064Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.1352742Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.1354238Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.1355676Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1357029Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.1358465Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1359940Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.1361229Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:40.1362497Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.1363754Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:40.1364850Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:40.1365909Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:40.1367168Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.1368624Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.1369789Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:40.1370876Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:40.1372100Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.1373498Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.1374608Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1375557Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1376404Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:40.1377462Z W0507 20:32:40.131000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.2907819Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.2908963Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:40.2910358Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.2911849Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.2913285Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.2914785Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.2916290Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.2917729Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.2919207Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.2920507Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:40.2921946Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.2923211Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:40.2924716Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:40.2925786Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:40.2927343Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.2928681Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.2929846Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:40.2931123Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:40.2932363Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.2933773Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.2934873Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.2935821Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.2936604Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:40.2937666Z W0507 20:32:40.287000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3299247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3299797Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.3300106Z 
2025-05-07T20:32:41.3300194Z     @given(
2025-05-07T20:32:41.3300460Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3300804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3301163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3301527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3301894Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3302353Z     )
2025-05-07T20:32:41.3302787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3303261Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3303521Z         self,
2025-05-07T20:32:41.3303732Z         T: int,
2025-05-07T20:32:41.3303944Z         D: int,
2025-05-07T20:32:41.3304178Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3304462Z         contiguous: bool,
2025-05-07T20:32:41.3304720Z         compiled: bool,
2025-05-07T20:32:41.3304964Z     ) -> None:
2025-05-07T20:32:41.3305190Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3305448Z     
2025-05-07T20:32:41.3305927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3306289Z     
2025-05-07T20:32:41.3306500Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3306813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3307140Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3307396Z         x0 = x[:, :D]
2025-05-07T20:32:41.3307629Z         x1 = x[:, D:]
2025-05-07T20:32:41.3307845Z     
2025-05-07T20:32:41.3308044Z         if contiguous:
2025-05-07T20:32:41.3308291Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3308558Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3308815Z     
2025-05-07T20:32:41.3309020Z         if scale_ub is not None:
2025-05-07T20:32:41.3309311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3309661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3309986Z             )
2025-05-07T20:32:41.3310192Z         else:
2025-05-07T20:32:41.3310410Z             scale_ub_tensor = None
2025-05-07T20:32:41.3310676Z     
2025-05-07T20:32:41.3310928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3311253Z             op = silu_mul_quant
2025-05-07T20:32:41.3311520Z             if compiled:
2025-05-07T20:32:41.3311786Z                 op = torch.compile(op)
2025-05-07T20:32:41.3312306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3312600Z     
2025-05-07T20:32:41.3312816Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3312989Z 
2025-05-07T20:32:41.3313096Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3313411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3313834Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3314138Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3314858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3315582Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3316151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3316861Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3317562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3318123Z     kernel = self.compile(
2025-05-07T20:32:41.3318694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3319376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3319807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3320045Z 
2025-05-07T20:32:41.3320268Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1a33bf40>
2025-05-07T20:32:41.3321400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3322844Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f20af0>}
2025-05-07T20:32:41.3324534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3325604Z context = <triton._C.libtriton.ir.context object at 0x7eff09df8c30>
2025-05-07T20:32:41.3325903Z 
2025-05-07T20:32:41.3326077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3326619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3327242Z                            module_map=module_map)
2025-05-07T20:32:41.3327630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3327991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3328263Z E       ^
2025-05-07T20:32:41.3328754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3329225Z 
2025-05-07T20:32:41.3329656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3330202Z 
2025-05-07T20:32:41.3330312Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3330748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3331172Z     T=1,
2025-05-07T20:32:41.3331365Z     D=7168,
2025-05-07T20:32:41.3331571Z     scale_ub=None,
2025-05-07T20:32:41.3331799Z     contiguous=True,
2025-05-07T20:32:41.3332038Z     compiled=True,
2025-05-07T20:32:41.3332296Z )
2025-05-07T20:32:41.3332642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3333142Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.3333416Z 
2025-05-07T20:32:41.3333623Z     @given(
2025-05-07T20:32:41.3333869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3334200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3334519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3334867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3335212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3335514Z     )
2025-05-07T20:32:41.3335885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3336348Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3336601Z         self,
2025-05-07T20:32:41.3336810Z         T: int,
2025-05-07T20:32:41.3337023Z         D: int,
2025-05-07T20:32:41.3337256Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3337543Z         contiguous: bool,
2025-05-07T20:32:41.3337797Z         compiled: bool,
2025-05-07T20:32:41.3338028Z     ) -> None:
2025-05-07T20:32:41.3338259Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3338522Z     
2025-05-07T20:32:41.3338802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3339165Z     
2025-05-07T20:32:41.3339379Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3339689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3340007Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3340262Z         x0 = x[:, :D]
2025-05-07T20:32:41.3340496Z         x1 = x[:, D:]
2025-05-07T20:32:41.3340713Z     
2025-05-07T20:32:41.3340919Z         if contiguous:
2025-05-07T20:32:41.3341166Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3341435Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3341693Z     
2025-05-07T20:32:41.3341901Z         if scale_ub is not None:
2025-05-07T20:32:41.3342192Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3342547Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3342875Z             )
2025-05-07T20:32:41.3343077Z         else:
2025-05-07T20:32:41.3343309Z             scale_ub_tensor = None
2025-05-07T20:32:41.3343578Z     
2025-05-07T20:32:41.3343817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3344151Z             op = silu_mul_quant
2025-05-07T20:32:41.3344420Z             if compiled:
2025-05-07T20:32:41.3344686Z                 op = torch.compile(op)
2025-05-07T20:32:41.3344998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3345288Z     
2025-05-07T20:32:41.3345495Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.3345815Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.3346120Z     
2025-05-07T20:32:41.3346373Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3346806Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.3347120Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.3347450Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.3347830Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3348160Z     
2025-05-07T20:32:41.3348378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.3348583Z 
2025-05-07T20:32:41.3348695Z moe/activation_test.py:126: 
2025-05-07T20:32:41.3349004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3349357Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.3349703Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.3350536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.3351318Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.3351895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3352610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3353410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.3354241Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.3355030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:41.3355813Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.3356567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.3357240Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.3357875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.3358417Z     fn()
2025-05-07T20:32:41.3358947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.3359566Z     self.fn.run(
2025-05-07T20:32:41.3360059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3360611Z     kernel = self.compile(
2025-05-07T20:32:41.3361183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3361870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3362288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3362530Z 
2025-05-07T20:32:41.3362748Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1a2f2710>
2025-05-07T20:32:41.3363880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3365315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea5ab0>}
2025-05-07T20:32:41.3366722Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3367796Z context = <triton._C.libtriton.ir.context object at 0x7eff099408b0>
2025-05-07T20:32:41.3368099Z 
2025-05-07T20:32:41.3368276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3368995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3369494Z                            module_map=module_map)
2025-05-07T20:32:41.3369875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3370255Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.3370538Z E       ^
2025-05-07T20:32:41.3371021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3371500Z 
2025-05-07T20:32:41.3371935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3372529Z 
2025-05-07T20:32:41.3372639Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3373075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3373490Z     T=4096,
2025-05-07T20:32:41.3373690Z     D=5120,
2025-05-07T20:32:41.3373901Z     scale_ub=None,
2025-05-07T20:32:41.3374127Z     contiguous=False,
2025-05-07T20:32:41.3374374Z     compiled=False,
2025-05-07T20:32:41.3374592Z )
2025-05-07T20:32:41.9428501Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:41.9429820Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:41.9431232Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:41.9432775Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:41.9434277Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:41.9435733Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.9437099Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:41.9438532Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.9440013Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:41.9441305Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:41.9442579Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:41.9443840Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:41.9444923Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:41.9446110Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:41.9447379Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:41.9448722Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:41.9449887Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:41.9450974Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:41.9452202Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:41.9453616Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:41.9454802Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.9455757Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.9456536Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:41.9457601Z W0507 20:32:41.939000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5693834Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.5695089Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:42.5696610Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.5698214Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.5699779Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.5701347Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5702826Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.5704373Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5706148Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.5707544Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:42.5708924Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.5710286Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:42.5711454Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:42.5712603Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:42.5714055Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.5715633Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.5716896Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:42.5718070Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:42.5719398Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.5720923Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.5722126Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5723161Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5724185Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:42.5725338Z W0507 20:32:42.565000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8314670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8315366Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.8315809Z 
2025-05-07T20:32:43.8315961Z     @given(
2025-05-07T20:32:43.8316335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8316795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8317149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8317546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8317932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8324767Z     )
2025-05-07T20:32:43.8325187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8325705Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8325992Z         self,
2025-05-07T20:32:43.8326215Z         T: int,
2025-05-07T20:32:43.8326450Z         D: int,
2025-05-07T20:32:43.8326902Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8327217Z         contiguous: bool,
2025-05-07T20:32:43.8327500Z         compiled: bool,
2025-05-07T20:32:43.8327766Z     ) -> None:
2025-05-07T20:32:43.8328014Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8328305Z     
2025-05-07T20:32:43.8328624Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8329011Z     
2025-05-07T20:32:43.8329239Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8329576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8329939Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8330212Z         x0 = x[:, :D]
2025-05-07T20:32:43.8330470Z         x1 = x[:, D:]
2025-05-07T20:32:43.8330711Z     
2025-05-07T20:32:43.8330921Z         if contiguous:
2025-05-07T20:32:43.8331189Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8331492Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8331770Z     
2025-05-07T20:32:43.8331997Z         if scale_ub is not None:
2025-05-07T20:32:43.8332327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8332710Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8333073Z             )
2025-05-07T20:32:43.8333327Z         else:
2025-05-07T20:32:43.8333699Z             scale_ub_tensor = None
2025-05-07T20:32:43.8333993Z     
2025-05-07T20:32:43.8334263Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8334617Z             op = silu_mul_quant
2025-05-07T20:32:43.8334904Z             if compiled:
2025-05-07T20:32:43.8335191Z                 op = torch.compile(op)
2025-05-07T20:32:43.8335527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8335844Z     
2025-05-07T20:32:43.8336068Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8336260Z 
2025-05-07T20:32:43.8336379Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8336715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8337105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8337432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8338216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8339009Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8339619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8340393Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8341144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8341748Z     kernel = self.compile(
2025-05-07T20:32:43.8342358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8343140Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8343643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8343908Z 
2025-05-07T20:32:43.8344153Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09502fe0>
2025-05-07T20:32:43.8345389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8346956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea6d40>}
2025-05-07T20:32:43.8348488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8349753Z context = <triton._C.libtriton.ir.context object at 0x7eff097c7770>
2025-05-07T20:32:43.8350083Z 
2025-05-07T20:32:43.8350273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8350869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8351411Z                            module_map=module_map)
2025-05-07T20:32:43.8351828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8352227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8352525Z E       ^
2025-05-07T20:32:43.8353055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8353667Z 
2025-05-07T20:32:43.8354146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8354728Z 
2025-05-07T20:32:43.8354847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8355325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8355788Z     T=4096,
2025-05-07T20:32:43.8356008Z     D=7168,
2025-05-07T20:32:43.8356232Z     scale_ub=None,
2025-05-07T20:32:43.8356483Z     contiguous=False,
2025-05-07T20:32:43.8356836Z     compiled=False,
2025-05-07T20:32:43.8357071Z )
2025-05-07T20:32:43.8357436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8357996Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.8358312Z 
2025-05-07T20:32:43.8358401Z     @given(
2025-05-07T20:32:43.8358666Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8359025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8359370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8359751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8360129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8360453Z     )
2025-05-07T20:32:43.8360863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8361369Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8361645Z         self,
2025-05-07T20:32:43.8361876Z         T: int,
2025-05-07T20:32:43.8362102Z         D: int,
2025-05-07T20:32:43.8362350Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8362670Z         contiguous: bool,
2025-05-07T20:32:43.8362948Z         compiled: bool,
2025-05-07T20:32:43.8363205Z     ) -> None:
2025-05-07T20:32:43.8363446Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8363723Z     
2025-05-07T20:32:43.8364035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8364420Z     
2025-05-07T20:32:43.8364667Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8365002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8365353Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8365629Z         x0 = x[:, :D]
2025-05-07T20:32:43.8365881Z         x1 = x[:, D:]
2025-05-07T20:32:43.8366122Z     
2025-05-07T20:32:43.8366334Z         if contiguous:
2025-05-07T20:32:43.8366605Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8366902Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8367180Z     
2025-05-07T20:32:43.8367403Z         if scale_ub is not None:
2025-05-07T20:32:43.8367716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8368100Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8368455Z             )
2025-05-07T20:32:43.8368680Z         else:
2025-05-07T20:32:43.8368917Z             scale_ub_tensor = None
2025-05-07T20:32:43.8369206Z     
2025-05-07T20:32:43.8369479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8369838Z             op = silu_mul_quant
2025-05-07T20:32:43.8370129Z             if compiled:
2025-05-07T20:32:43.8370415Z                 op = torch.compile(op)
2025-05-07T20:32:43.8370747Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8371163Z     
2025-05-07T20:32:43.8371396Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8371588Z 
2025-05-07T20:32:43.8371710Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8372042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8372430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8372761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8373545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8374334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8374945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8375724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8376480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8377088Z     kernel = self.compile(
2025-05-07T20:32:43.8377706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8378538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8378994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8379261Z 
2025-05-07T20:32:43.8379495Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09eb7820>
2025-05-07T20:32:43.8380726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8382284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea7c70>}
2025-05-07T20:32:43.8383861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8385037Z context = <triton._C.libtriton.ir.context object at 0x7eff092aeeb0>
2025-05-07T20:32:43.8385366Z 
2025-05-07T20:32:43.8385564Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8386167Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8386697Z                            module_map=module_map)
2025-05-07T20:32:43.8387114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8387516Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8387812Z E       ^
2025-05-07T20:32:43.8388344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8388857Z 
2025-05-07T20:32:43.8389333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8389914Z 
2025-05-07T20:32:43.8390044Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8390512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8390967Z     T=128,
2025-05-07T20:32:43.8391183Z     D=7168,
2025-05-07T20:32:43.8391400Z     scale_ub=None,
2025-05-07T20:32:43.8391651Z     contiguous=False,
2025-05-07T20:32:43.8391913Z     compiled=True,
2025-05-07T20:32:43.8392139Z )
2025-05-07T20:32:43.9072761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.9073577Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.9073886Z 
2025-05-07T20:32:43.9073974Z     @given(
2025-05-07T20:32:43.9074233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.9074759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.9075111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.9075480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.9075852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.9076173Z     )
2025-05-07T20:32:43.9076568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.9077055Z     def test_silu_mul_quant(
2025-05-07T20:32:43.9077326Z         self,
2025-05-07T20:32:43.9077544Z         T: int,
2025-05-07T20:32:43.9077763Z         D: int,
2025-05-07T20:32:43.9078010Z         scale_ub: Optional[float],
2025-05-07T20:32:43.9078313Z         contiguous: bool,
2025-05-07T20:32:43.9078581Z         compiled: bool,
2025-05-07T20:32:43.9078827Z     ) -> None:
2025-05-07T20:32:43.9079069Z         torch.manual_seed(2025)
2025-05-07T20:32:43.9079337Z     
2025-05-07T20:32:43.9079646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.9080023Z     
2025-05-07T20:32:43.9080242Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.9080561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.9081068Z         x = x_sign * x_clamp
2025-05-07T20:32:43.9081339Z         x0 = x[:, :D]
2025-05-07T20:32:43.9081576Z         x1 = x[:, D:]
2025-05-07T20:32:43.9081810Z     
2025-05-07T20:32:43.9082021Z         if contiguous:
2025-05-07T20:32:43.9082277Z             x0 = x0.contiguous()
2025-05-07T20:32:43.9082561Z             x1 = x1.contiguous()
2025-05-07T20:32:43.9082827Z     
2025-05-07T20:32:43.9083036Z         if scale_ub is not None:
2025-05-07T20:32:43.9083341Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.9083715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.9084063Z             )
2025-05-07T20:32:43.9084276Z         else:
2025-05-07T20:32:43.9084513Z             scale_ub_tensor = None
2025-05-07T20:32:43.9084793Z     
2025-05-07T20:32:43.9085064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.9085417Z             op = silu_mul_quant
2025-05-07T20:32:43.9085701Z             if compiled:
2025-05-07T20:32:43.9085991Z                 op = torch.compile(op)
2025-05-07T20:32:43.9086325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.9086635Z     
2025-05-07T20:32:43.9086851Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.9087176Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.9087504Z     
2025-05-07T20:32:43.9087768Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.9088147Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.9088478Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.9088822Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.9089224Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.9089578Z     
2025-05-07T20:32:43.9089811Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.9090031Z 
2025-05-07T20:32:43.9090142Z moe/activation_test.py:126: 
2025-05-07T20:32:43.9090476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.9090858Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.9091224Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.9092098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.9092961Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.9093596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.9094348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.9095204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.9096012Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.9096845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:43.9097687Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.9098495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.9099208Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.9099871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.9100448Z     fn()
2025-05-07T20:32:43.9101012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.9101657Z     self.fn.run(
2025-05-07T20:32:43.9102178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.9102767Z     kernel = self.compile(
2025-05-07T20:32:43.9103370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.9104180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.9104623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.9104883Z 
2025-05-07T20:32:43.9105115Z self = <triton.compiler.compiler.ASTSource object at 0x7eff097ab1f0>
2025-05-07T20:32:43.9106313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.9107832Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea7ac0>}
2025-05-07T20:32:43.9109316Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.9110459Z context = <triton._C.libtriton.ir.context object at 0x7eff091b98f0>
2025-05-07T20:32:43.9110781Z 
2025-05-07T20:32:43.9110978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.9111560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.9112080Z                            module_map=module_map)
2025-05-07T20:32:43.9112497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.9112897Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.9113190Z E       ^
2025-05-07T20:32:43.9113765Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.9114262Z 
2025-05-07T20:32:43.9114726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.9115297Z 
2025-05-07T20:32:43.9115420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.9115877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.9116323Z     T=128,
2025-05-07T20:32:43.9116539Z     D=7168,
2025-05-07T20:32:43.9116756Z     scale_ub=None,
2025-05-07T20:32:43.9117002Z     contiguous=False,
2025-05-07T20:32:43.9117262Z     compiled=False,
2025-05-07T20:32:43.9117489Z )
2025-05-07T20:32:44.2970707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2971363Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.2971803Z 
2025-05-07T20:32:44.2972135Z     @given(
2025-05-07T20:32:44.2972511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2973007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2973462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2973966Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2974368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2974698Z     )
2025-05-07T20:32:44.2975095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2975602Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2975881Z         self,
2025-05-07T20:32:44.2976101Z         T: int,
2025-05-07T20:32:44.2976362Z         D: int,
2025-05-07T20:32:44.2976613Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2976924Z         contiguous: bool,
2025-05-07T20:32:44.2977203Z         compiled: bool,
2025-05-07T20:32:44.2977461Z     ) -> None:
2025-05-07T20:32:44.2977717Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2977992Z     
2025-05-07T20:32:44.2978296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2978683Z     
2025-05-07T20:32:44.2978909Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2979390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2979742Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2980018Z         x0 = x[:, :D]
2025-05-07T20:32:44.2980262Z         x1 = x[:, D:]
2025-05-07T20:32:44.2980499Z     
2025-05-07T20:32:44.2980717Z         if contiguous:
2025-05-07T20:32:44.2980979Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2981274Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2981554Z     
2025-05-07T20:32:44.2981773Z         if scale_ub is not None:
2025-05-07T20:32:44.2982087Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2982471Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2982823Z             )
2025-05-07T20:32:44.2983047Z         else:
2025-05-07T20:32:44.2983295Z             scale_ub_tensor = None
2025-05-07T20:32:44.2983587Z     
2025-05-07T20:32:44.2983850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2984218Z             op = silu_mul_quant
2025-05-07T20:32:44.2984509Z             if compiled:
2025-05-07T20:32:44.2984790Z                 op = torch.compile(op)
2025-05-07T20:32:44.2985130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2985449Z     
2025-05-07T20:32:44.2985672Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2985868Z 
2025-05-07T20:32:44.2985982Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2986322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2986692Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2987021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2987809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2988589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2989191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2989973Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2990723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2991597Z     kernel = self.compile(
2025-05-07T20:32:44.2992368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2993115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2993617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2993876Z 
2025-05-07T20:32:44.2994110Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09500ee0>
2025-05-07T20:32:44.2995779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2997325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1a0cbb50>}
2025-05-07T20:32:44.2998833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2999981Z context = <triton._C.libtriton.ir.context object at 0x7eff091e5af0>
2025-05-07T20:32:44.3000304Z 
2025-05-07T20:32:44.3000494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3001086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3001617Z                            module_map=module_map)
2025-05-07T20:32:44.3002031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3002520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3002820Z E       ^
2025-05-07T20:32:44.3003348Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3003853Z 
2025-05-07T20:32:44.3004320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3004899Z 
2025-05-07T20:32:44.3005018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3005487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3005944Z     T=4096,
2025-05-07T20:32:44.3006158Z     D=5120,
2025-05-07T20:32:44.3006387Z     scale_ub=1200.0,
2025-05-07T20:32:44.3006650Z     contiguous=True,
2025-05-07T20:32:44.3006903Z     compiled=False,
2025-05-07T20:32:44.3007147Z )
2025-05-07T20:32:44.3007514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3008075Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3008394Z 
2025-05-07T20:32:44.3008483Z     @given(
2025-05-07T20:32:44.3008754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3009107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3009461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3009842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3010219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3010541Z     )
2025-05-07T20:32:44.3010945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3011448Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3011723Z         self,
2025-05-07T20:32:44.3011962Z         T: int,
2025-05-07T20:32:44.3012196Z         D: int,
2025-05-07T20:32:44.3012447Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3012823Z         contiguous: bool,
2025-05-07T20:32:44.3013179Z         compiled: bool,
2025-05-07T20:32:44.3013498Z     ) -> None:
2025-05-07T20:32:44.3013811Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3014160Z     
2025-05-07T20:32:44.3014548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3015035Z     
2025-05-07T20:32:44.3015316Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3015646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3016000Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3016279Z         x0 = x[:, :D]
2025-05-07T20:32:44.3016532Z         x1 = x[:, D:]
2025-05-07T20:32:44.3016770Z     
2025-05-07T20:32:44.3016988Z         if contiguous:
2025-05-07T20:32:44.3017266Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3017683Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3017964Z     
2025-05-07T20:32:44.3018189Z         if scale_ub is not None:
2025-05-07T20:32:44.3018500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3018887Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3019249Z             )
2025-05-07T20:32:44.3025990Z         else:
2025-05-07T20:32:44.3026275Z             scale_ub_tensor = None
2025-05-07T20:32:44.3026576Z     
2025-05-07T20:32:44.3026853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3027214Z             op = silu_mul_quant
2025-05-07T20:32:44.3027511Z             if compiled:
2025-05-07T20:32:44.3027798Z                 op = torch.compile(op)
2025-05-07T20:32:44.3028136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3028450Z     
2025-05-07T20:32:44.3028679Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3028869Z 
2025-05-07T20:32:44.3028985Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3029334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3029719Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3030046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3031008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3031792Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3032401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3033167Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3033988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3034596Z     kernel = self.compile(
2025-05-07T20:32:44.3035212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3035943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3036393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3036670Z 
2025-05-07T20:32:44.3036907Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09503940>
2025-05-07T20:32:44.3038119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3039657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0994a0e0>}
2025-05-07T20:32:44.3041173Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3042329Z context = <triton._C.libtriton.ir.context object at 0x7eff08c56bb0>
2025-05-07T20:32:44.3042657Z 
2025-05-07T20:32:44.3042852Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3043525Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3044066Z                            module_map=module_map)
2025-05-07T20:32:44.3044490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3044896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3045188Z E       ^
2025-05-07T20:32:44.3045720Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3046232Z 
2025-05-07T20:32:44.3046706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3047286Z 
2025-05-07T20:32:44.3047548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3048021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3048482Z     T=1,
2025-05-07T20:32:44.3048701Z     D=5120,
2025-05-07T20:32:44.3048919Z     scale_ub=None,
2025-05-07T20:32:44.3049169Z     contiguous=True,
2025-05-07T20:32:44.3049432Z     compiled=True,
2025-05-07T20:32:44.3049664Z )
2025-05-07T20:32:44.8011030Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.8012629Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:44.8014176Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.8015798Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.8017555Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.8019133Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8020626Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.8022207Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8024061Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.8025465Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.8026840Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.8028207Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:44.8029378Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:44.8030529Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:44.8031907Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.8033355Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.8034850Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:44.8036033Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:44.8037366Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.8038896Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.8040091Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8041123Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8041962Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:44.8043113Z W0507 20:32:44.797000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9778874Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.9780326Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:44.9781831Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.9783420Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.9784972Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.9786520Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9787977Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.9789518Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9791103Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.9792495Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.9793928Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.9795286Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:44.9796630Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:44.9797784Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:44.9799151Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.9800590Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.9801851Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:44.9803023Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:44.9804342Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.9806014Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.9807213Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9808238Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9809081Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:44.9810225Z W0507 20:32:44.974000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.4532194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.4532871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.4533338Z 
2025-05-07T20:32:45.4533467Z     @given(
2025-05-07T20:32:45.4533833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.4534311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.4534795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.4535291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.4535774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.4536119Z     )
2025-05-07T20:32:45.4536525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.4537025Z     def test_silu_mul_quant(
2025-05-07T20:32:45.4537298Z         self,
2025-05-07T20:32:45.4537531Z         T: int,
2025-05-07T20:32:45.4537758Z         D: int,
2025-05-07T20:32:45.4538006Z         scale_ub: Optional[float],
2025-05-07T20:32:45.4538320Z         contiguous: bool,
2025-05-07T20:32:45.4538595Z         compiled: bool,
2025-05-07T20:32:45.4538847Z     ) -> None:
2025-05-07T20:32:45.4539096Z         torch.manual_seed(2025)
2025-05-07T20:32:45.4539372Z     
2025-05-07T20:32:45.4539676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.4540064Z     
2025-05-07T20:32:45.4540285Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.4540617Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.4540963Z         x = x_sign * x_clamp
2025-05-07T20:32:45.4541238Z         x0 = x[:, :D]
2025-05-07T20:32:45.4541670Z         x1 = x[:, D:]
2025-05-07T20:32:45.4541910Z     
2025-05-07T20:32:45.4542126Z         if contiguous:
2025-05-07T20:32:45.4542395Z             x0 = x0.contiguous()
2025-05-07T20:32:45.4542689Z             x1 = x1.contiguous()
2025-05-07T20:32:45.4542968Z     
2025-05-07T20:32:45.4543196Z         if scale_ub is not None:
2025-05-07T20:32:45.4543530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.4543938Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.4544293Z             )
2025-05-07T20:32:45.4544510Z         else:
2025-05-07T20:32:45.4544757Z             scale_ub_tensor = None
2025-05-07T20:32:45.4545047Z     
2025-05-07T20:32:45.4545311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.4545671Z             op = silu_mul_quant
2025-05-07T20:32:45.4545958Z             if compiled:
2025-05-07T20:32:45.4546244Z                 op = torch.compile(op)
2025-05-07T20:32:45.4546585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.4546912Z     
2025-05-07T20:32:45.4547139Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.4547459Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.4547786Z     
2025-05-07T20:32:45.4548202Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.4548572Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.4548907Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.4549267Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.4549665Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.4550017Z     
2025-05-07T20:32:45.4550249Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.4550470Z 
2025-05-07T20:32:45.4550591Z moe/activation_test.py:126: 
2025-05-07T20:32:45.4550928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4551312Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.4551693Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.4552574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.4553450Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.4554206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.4554970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.4555735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.4556545Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.4557390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:45.4558236Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.4559045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.4559769Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.4560446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.4561024Z     fn()
2025-05-07T20:32:45.4561593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.4562244Z     self.fn.run(
2025-05-07T20:32:45.4562771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.4563381Z     kernel = self.compile(
2025-05-07T20:32:45.4564051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.4564880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.4565332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4565603Z 
2025-05-07T20:32:45.4565838Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0944d720>
2025-05-07T20:32:45.4567054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.4568594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f20ee0>}
2025-05-07T20:32:45.4570109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.4571259Z context = <triton._C.libtriton.ir.context object at 0x7eff08c771f0>
2025-05-07T20:32:45.4571591Z 
2025-05-07T20:32:45.4571783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.4572533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.4573075Z                            module_map=module_map)
2025-05-07T20:32:45.4573497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.4573907Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.4574207Z E       ^
2025-05-07T20:32:45.4574737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.4575254Z 
2025-05-07T20:32:45.4575732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.4576310Z 
2025-05-07T20:32:45.4576444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.4576914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.4577375Z     T=2048,
2025-05-07T20:32:45.4577609Z     D=5120,
2025-05-07T20:32:45.4577831Z     scale_ub=None,
2025-05-07T20:32:45.4578083Z     contiguous=True,
2025-05-07T20:32:45.4578343Z     compiled=True,
2025-05-07T20:32:45.4578574Z )
2025-05-07T20:32:45.9083737Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.9085455Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:45.9087004Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.9088631Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.9090204Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.9091782Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.9093269Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.9095015Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.9096630Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.9098041Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.9099419Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.9100794Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:45.9101966Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:45.9103241Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:45.9104661Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.9106104Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.9107376Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:45.9108556Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:45.9109892Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.9111414Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.9112609Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.9113755Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.9114624Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:45.9115774Z W0507 20:32:45.904000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.0820704Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.0822156Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:46.0824091Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.0825678Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.0827238Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.0828765Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0830215Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.0831742Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0833311Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.0834896Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:46.0836249Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.0837583Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:46.0838738Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:46.0839878Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:46.0841230Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.0842641Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.0843940Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:46.0845097Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:46.0846408Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.0847912Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.0849081Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0850095Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.0851029Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:46.0852165Z W0507 20:32:46.078000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.5581386Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.5582062Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.5582498Z 
2025-05-07T20:32:46.5582658Z     @given(
2025-05-07T20:32:46.5583046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.5583552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.5584092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.5584584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.5584966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.5585295Z     )
2025-05-07T20:32:46.5585713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.5586246Z     def test_silu_mul_quant(
2025-05-07T20:32:46.5586523Z         self,
2025-05-07T20:32:46.5586954Z         T: int,
2025-05-07T20:32:46.5587185Z         D: int,
2025-05-07T20:32:46.5587432Z         scale_ub: Optional[float],
2025-05-07T20:32:46.5587747Z         contiguous: bool,
2025-05-07T20:32:46.5588027Z         compiled: bool,
2025-05-07T20:32:46.5588282Z     ) -> None:
2025-05-07T20:32:46.5588528Z         torch.manual_seed(2025)
2025-05-07T20:32:46.5588806Z     
2025-05-07T20:32:46.5589114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.5589502Z     
2025-05-07T20:32:46.5589725Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.5590058Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.5590405Z         x = x_sign * x_clamp
2025-05-07T20:32:46.5590892Z         x0 = x[:, :D]
2025-05-07T20:32:46.5591146Z         x1 = x[:, D:]
2025-05-07T20:32:46.5591383Z     
2025-05-07T20:32:46.5591603Z         if contiguous:
2025-05-07T20:32:46.5591869Z             x0 = x0.contiguous()
2025-05-07T20:32:46.5592159Z             x1 = x1.contiguous()
2025-05-07T20:32:46.5592447Z     
2025-05-07T20:32:46.5592672Z         if scale_ub is not None:
2025-05-07T20:32:46.5592979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.5593360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.5593787Z             )
2025-05-07T20:32:46.5594003Z         else:
2025-05-07T20:32:46.5594247Z             scale_ub_tensor = None
2025-05-07T20:32:46.5594540Z     
2025-05-07T20:32:46.5594801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.5595165Z             op = silu_mul_quant
2025-05-07T20:32:46.5595458Z             if compiled:
2025-05-07T20:32:46.5595744Z                 op = torch.compile(op)
2025-05-07T20:32:46.5596075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.5596393Z     
2025-05-07T20:32:46.5596617Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.5596936Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.5597268Z     
2025-05-07T20:32:46.5597549Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.5597924Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.5598260Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.5598620Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.5599021Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.5599375Z     
2025-05-07T20:32:46.5599612Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.5599832Z 
2025-05-07T20:32:46.5599956Z moe/activation_test.py:126: 
2025-05-07T20:32:46.5600290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5600672Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.5601181Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.5602068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.5602914Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.5603534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.5604355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.5605130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.5605944Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.5606791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:46.5607640Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.5608453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.5609263Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.5609939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.5610516Z     fn()
2025-05-07T20:32:46.5611089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.5611748Z     self.fn.run(
2025-05-07T20:32:46.5612276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.5612867Z     kernel = self.compile(
2025-05-07T20:32:46.5613474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.5614215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.5614658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5614927Z 
2025-05-07T20:32:46.5615161Z self = <triton.compiler.compiler.ASTSource object at 0x7eff094285e0>
2025-05-07T20:32:46.5616376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.5617919Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949e7a0>}
2025-05-07T20:32:46.5619426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.5620569Z context = <triton._C.libtriton.ir.context object at 0x7eff08dd7470>
2025-05-07T20:32:46.5620898Z 
2025-05-07T20:32:46.5621086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.5621679Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.5622211Z                            module_map=module_map)
2025-05-07T20:32:46.5622626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.5623028Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.5623329Z E       ^
2025-05-07T20:32:46.5624140Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.5624656Z 
2025-05-07T20:32:46.5625122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.5625702Z 
2025-05-07T20:32:46.5625957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.5626431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.5626878Z     T=128,
2025-05-07T20:32:46.5627100Z     D=5120,
2025-05-07T20:32:46.5627327Z     scale_ub=None,
2025-05-07T20:32:46.5627573Z     contiguous=True,
2025-05-07T20:32:46.5627831Z     compiled=True,
2025-05-07T20:32:46.5628067Z )
2025-05-07T20:32:47.0549945Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:47.0551995Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:47.0554311Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:47.0555882Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:47.0557585Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:47.0559100Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.0560540Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:47.0562051Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.0563613Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:47.0564981Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:47.0566312Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:47.0567646Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:47.0568783Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:47.0569904Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:47.0577369Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:47.0578796Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:47.0580204Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:47.0581360Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:47.0582661Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:47.0584211Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:47.0585377Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.0586381Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.0587203Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:47.0588319Z W0507 20:32:47.051000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.2307678Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:47.2309748Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:47.2312344Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:47.2314634Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:47.2316175Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:47.2317699Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.2319145Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:47.2320674Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.2322245Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:47.2323629Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:47.2325242Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:47.2326586Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:47.2327917Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:47.2329057Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:47.2330412Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:47.2331823Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:47.2333072Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:47.2334226Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:47.2335532Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:47.2337158Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:47.2338327Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.2339346Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.2340177Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:47.2341311Z W0507 20:32:47.227000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.9945529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.9947094Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.9947862Z 
2025-05-07T20:32:47.9948043Z     @given(
2025-05-07T20:32:47.9948554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.9949240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.9949896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.9950588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.9951290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.9951904Z     )
2025-05-07T20:32:47.9952663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.9953714Z     def test_silu_mul_quant(
2025-05-07T20:32:47.9954238Z         self,
2025-05-07T20:32:47.9954495Z         T: int,
2025-05-07T20:32:47.9954735Z         D: int,
2025-05-07T20:32:47.9954977Z         scale_ub: Optional[float],
2025-05-07T20:32:47.9955275Z         contiguous: bool,
2025-05-07T20:32:47.9955533Z         compiled: bool,
2025-05-07T20:32:47.9955781Z     ) -> None:
2025-05-07T20:32:47.9956022Z         torch.manual_seed(2025)
2025-05-07T20:32:47.9956287Z     
2025-05-07T20:32:47.9956590Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.9956956Z     
2025-05-07T20:32:47.9957161Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.9957479Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.9957813Z         x = x_sign * x_clamp
2025-05-07T20:32:47.9958069Z         x0 = x[:, :D]
2025-05-07T20:32:47.9958481Z         x1 = x[:, D:]
2025-05-07T20:32:47.9958716Z     
2025-05-07T20:32:47.9958913Z         if contiguous:
2025-05-07T20:32:47.9959168Z             x0 = x0.contiguous()
2025-05-07T20:32:47.9959448Z             x1 = x1.contiguous()
2025-05-07T20:32:47.9959707Z     
2025-05-07T20:32:47.9959910Z         if scale_ub is not None:
2025-05-07T20:32:47.9960207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.9960567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.9960891Z             )
2025-05-07T20:32:47.9961101Z         else:
2025-05-07T20:32:47.9961329Z             scale_ub_tensor = None
2025-05-07T20:32:47.9961593Z     
2025-05-07T20:32:47.9961841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9962176Z             op = silu_mul_quant
2025-05-07T20:32:47.9962440Z             if compiled:
2025-05-07T20:32:47.9962708Z                 op = torch.compile(op)
2025-05-07T20:32:47.9963024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9963318Z     
2025-05-07T20:32:47.9963526Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.9963833Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.9964143Z     
2025-05-07T20:32:47.9964522Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9964878Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.9965188Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.9965518Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.9965897Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.9966230Z     
2025-05-07T20:32:47.9966443Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.9966653Z 
2025-05-07T20:32:47.9966760Z moe/activation_test.py:126: 
2025-05-07T20:32:47.9967078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9967426Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.9967784Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.9968619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.9969421Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.9969997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.9970720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.9971453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.9972224Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.9973015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:47.9973809Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.9974629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.9975316Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.9975951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.9976505Z     fn()
2025-05-07T20:32:47.9977046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.9977656Z     self.fn.run(
2025-05-07T20:32:47.9978151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.9978712Z     kernel = self.compile(
2025-05-07T20:32:47.9979284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.9980050Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.9980471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9980716Z 
2025-05-07T20:32:47.9980943Z self = <triton.compiler.compiler.ASTSource object at 0x7eff097a8a30>
2025-05-07T20:32:47.9982068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.9983497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949f640>}
2025-05-07T20:32:47.9984955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.9986024Z context = <triton._C.libtriton.ir.context object at 0x7eff0897a430>
2025-05-07T20:32:47.9986327Z 
2025-05-07T20:32:47.9986511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.9987163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.9987655Z                            module_map=module_map)
2025-05-07T20:32:47.9988047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.9988417Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.9988710Z E       ^
2025-05-07T20:32:47.9989197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.9989669Z 
2025-05-07T20:32:47.9990108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.9990644Z 
2025-05-07T20:32:47.9990761Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.9991199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.9991620Z     T=4096,
2025-05-07T20:32:47.9991828Z     D=5120,
2025-05-07T20:32:47.9992029Z     scale_ub=None,
2025-05-07T20:32:47.9992259Z     contiguous=True,
2025-05-07T20:32:47.9992496Z     compiled=True,
2025-05-07T20:32:47.9992708Z )
2025-05-07T20:32:48.4684866Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:48.4686262Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:48.4687700Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:48.4689223Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:48.4690692Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:48.4692157Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4693543Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:48.4695172Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4696691Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:48.4698007Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:48.4699281Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:48.4700561Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:48.4701652Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:48.4702857Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:48.4704138Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:48.4705491Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:48.4706681Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:48.4707786Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:48.4709042Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:48.4710462Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:48.4711574Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4712530Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4713313Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:48.4714506Z W0507 20:32:48.465000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.6324855Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:48.6326272Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:48.6327851Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:48.6329337Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:48.6330782Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:48.6332226Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.6333579Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:48.6335016Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.6336490Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:48.6337907Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:48.6339177Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:48.6340438Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:48.6341524Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit
2025-05-07T20:32:48.6342599Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:48.6343878Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:48.6345221Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:48.6346383Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit
2025-05-07T20:32:48.6347480Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:48.6348712Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:48.6350129Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:48.6351233Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.6352187Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.6353039Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:48.6354193Z W0507 20:32:48.629000 87500 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.2122886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.2123708Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.2124468Z 
2025-05-07T20:32:49.2124613Z     @given(
2025-05-07T20:32:49.2124967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.2125295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.2125629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.2125984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.2126334Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.2126635Z     )
2025-05-07T20:32:49.2127022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.2127492Z     def test_silu_mul_quant(
2025-05-07T20:32:49.2127749Z         self,
2025-05-07T20:32:49.2128160Z         T: int,
2025-05-07T20:32:49.2128374Z         D: int,
2025-05-07T20:32:49.2128606Z         scale_ub: Optional[float],
2025-05-07T20:32:49.2128904Z         contiguous: bool,
2025-05-07T20:32:49.2129175Z         compiled: bool,
2025-05-07T20:32:49.2129424Z     ) -> None:
2025-05-07T20:32:49.2129661Z         torch.manual_seed(2025)
2025-05-07T20:32:49.2129970Z     
2025-05-07T20:32:49.2130356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.2130725Z     
2025-05-07T20:32:49.2130938Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.2131255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.2131582Z         x = x_sign * x_clamp
2025-05-07T20:32:49.2131847Z         x0 = x[:, :D]
2025-05-07T20:32:49.2132090Z         x1 = x[:, D:]
2025-05-07T20:32:49.2132308Z     
2025-05-07T20:32:49.2132513Z         if contiguous:
2025-05-07T20:32:49.2132764Z             x0 = x0.contiguous()
2025-05-07T20:32:49.2133037Z             x1 = x1.contiguous()
2025-05-07T20:32:49.2133311Z     
2025-05-07T20:32:49.2133520Z         if scale_ub is not None:
2025-05-07T20:32:49.2133805Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.2134164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.2134495Z             )
2025-05-07T20:32:49.2134696Z         else:
2025-05-07T20:32:49.2134922Z             scale_ub_tensor = None
2025-05-07T20:32:49.2135190Z     
2025-05-07T20:32:49.2135435Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.2135768Z             op = silu_mul_quant
2025-05-07T20:32:49.2136035Z             if compiled:
2025-05-07T20:32:49.2136293Z                 op = torch.compile(op)
2025-05-07T20:32:49.2136608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.2136907Z     
2025-05-07T20:32:49.2137115Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.2137413Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.2137727Z     
2025-05-07T20:32:49.2137986Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.2138334Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.2138647Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.2138982Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.2139356Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.2139688Z     
2025-05-07T20:32:49.2139908Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.2140114Z 
2025-05-07T20:32:49.2140229Z moe/activation_test.py:126: 
2025-05-07T20:32:49.2140547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2141038Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.2141569Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.2142392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.2143183Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.2143757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.2144470Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.2145187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.2145942Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.2146733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:49.2147517Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.2148281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.2149035Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.2149663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.2150200Z     fn()
2025-05-07T20:32:49.2150734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.2151345Z     self.fn.run(
2025-05-07T20:32:49.2151842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.2152395Z     kernel = self.compile(
2025-05-07T20:32:49.2152963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.2153736Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.2154150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.2154405Z 
2025-05-07T20:32:49.2154622Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0831ed10>
2025-05-07T20:32:49.2155750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.2157178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0888f5b0>}
2025-05-07T20:32:49.2158569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.2159630Z context = <triton._C.libtriton.ir.context object at 0x7eff08268830>
2025-05-07T20:32:49.2159934Z 
2025-05-07T20:32:49.2160108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.2160665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.2161156Z                            module_map=module_map)
2025-05-07T20:32:49.2161533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.2161921Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.2162204Z E       ^
2025-05-07T20:32:49.2162683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.2163158Z 
2025-05-07T20:32:49.2163587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.2164127Z 
2025-05-07T20:32:49.2164344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.2164838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.2165261Z     T=16384,
2025-05-07T20:32:49.2165477Z     D=5120,
2025-05-07T20:32:49.2165688Z     scale_ub=None,
2025-05-07T20:32:49.2165914Z     contiguous=True,
2025-05-07T20:32:49.2166154Z     compiled=True,
2025-05-07T20:32:49.2166373Z )
2025-05-07T20:32:49.2558882Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:49.2560499Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:49.2561894Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:49.2562922Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:49.2564238Z W0507 20:32:49.254000 87500 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:49.3601865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3602626Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.3603048Z 
2025-05-07T20:32:49.3603167Z     @given(
2025-05-07T20:32:49.3603545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3603912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3604255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3604957Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3605666Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3606269Z     )
2025-05-07T20:32:49.3607000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3607942Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3608454Z         self,
2025-05-07T20:32:49.3608859Z         T: int,
2025-05-07T20:32:49.3609275Z         D: int,
2025-05-07T20:32:49.3609734Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3610301Z         contiguous: bool,
2025-05-07T20:32:49.3610808Z         compiled: bool,
2025-05-07T20:32:49.3611283Z     ) -> None:
2025-05-07T20:32:49.3611727Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3612232Z     
2025-05-07T20:32:49.3612804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3613515Z     
2025-05-07T20:32:49.3613921Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3614429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3614769Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3615024Z         x0 = x[:, :D]
2025-05-07T20:32:49.3615260Z         x1 = x[:, D:]
2025-05-07T20:32:49.3615489Z     
2025-05-07T20:32:49.3615684Z         if contiguous:
2025-05-07T20:32:49.3615939Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3616220Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3616476Z     
2025-05-07T20:32:49.3616685Z         if scale_ub is not None:
2025-05-07T20:32:49.3616984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3617338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3617674Z             )
2025-05-07T20:32:49.3617885Z         else:
2025-05-07T20:32:49.3618108Z             scale_ub_tensor = None
2025-05-07T20:32:49.3618386Z     
2025-05-07T20:32:49.3618639Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3618970Z             op = silu_mul_quant
2025-05-07T20:32:49.3619243Z             if compiled:
2025-05-07T20:32:49.3619697Z                 op = torch.compile(op)
2025-05-07T20:32:49.3620018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3620316Z     
2025-05-07T20:32:49.3620532Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.3620846Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.3621150Z     
2025-05-07T20:32:49.3621411Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3621772Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.3622081Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.3622415Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.3622796Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.3623120Z     
2025-05-07T20:32:49.3623339Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.3623545Z 
2025-05-07T20:32:49.3623659Z moe/activation_test.py:126: 
2025-05-07T20:32:49.3624249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3624652Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.3625000Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.3625831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.3626757Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.3627334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3628054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3628780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.3629536Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.3630336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:49.3631127Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.3631893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.3632574Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.3638905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.3639474Z     fn()
2025-05-07T20:32:49.3640010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.3640627Z     self.fn.run(
2025-05-07T20:32:49.3641125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3641727Z     kernel = self.compile(
2025-05-07T20:32:49.3642561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3643347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3643784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3644028Z 
2025-05-07T20:32:49.3644249Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08d383a0>
2025-05-07T20:32:49.3645390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3646841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949ec20>}
2025-05-07T20:32:49.3648476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3649550Z context = <triton._C.libtriton.ir.context object at 0x7efdf7ff55f0>
2025-05-07T20:32:49.3649867Z 
2025-05-07T20:32:49.3650043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3650596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3651099Z                            module_map=module_map)
2025-05-07T20:32:49.3651482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3651861Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.3652145Z E       ^
2025-05-07T20:32:49.3652719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3653347Z 
2025-05-07T20:32:49.3653794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3654345Z 
2025-05-07T20:32:49.3654478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3654946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3655467Z     T=1,
2025-05-07T20:32:49.3655667Z     D=5120,
2025-05-07T20:32:49.3655884Z     scale_ub=1200.0,
2025-05-07T20:32:49.3656121Z     contiguous=True,
2025-05-07T20:32:49.3656364Z     compiled=True,
2025-05-07T20:32:49.3656586Z )
2025-05-07T20:32:49.5100793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.5101589Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.5101975Z 
2025-05-07T20:32:49.5102096Z     @given(
2025-05-07T20:32:49.5102393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.5102732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.5103067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.5103425Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.5103775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.5104082Z     )
2025-05-07T20:32:49.5104471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.5104948Z     def test_silu_mul_quant(
2025-05-07T20:32:49.5105207Z         self,
2025-05-07T20:32:49.5105425Z         T: int,
2025-05-07T20:32:49.5105637Z         D: int,
2025-05-07T20:32:49.5105872Z         scale_ub: Optional[float],
2025-05-07T20:32:49.5106166Z         contiguous: bool,
2025-05-07T20:32:49.5106426Z         compiled: bool,
2025-05-07T20:32:49.5106674Z     ) -> None:
2025-05-07T20:32:49.5106909Z         torch.manual_seed(2025)
2025-05-07T20:32:49.5107167Z     
2025-05-07T20:32:49.5107454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.5107818Z     
2025-05-07T20:32:49.5108036Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.5108340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.5108672Z         x = x_sign * x_clamp
2025-05-07T20:32:49.5108931Z         x0 = x[:, :D]
2025-05-07T20:32:49.5109165Z         x1 = x[:, D:]
2025-05-07T20:32:49.5109395Z     
2025-05-07T20:32:49.5109598Z         if contiguous:
2025-05-07T20:32:49.5109844Z             x0 = x0.contiguous()
2025-05-07T20:32:49.5110123Z             x1 = x1.contiguous()
2025-05-07T20:32:49.5110381Z     
2025-05-07T20:32:49.5110586Z         if scale_ub is not None:
2025-05-07T20:32:49.5110880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.5111239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.5111561Z             )
2025-05-07T20:32:49.5111772Z         else:
2025-05-07T20:32:49.5112001Z             scale_ub_tensor = None
2025-05-07T20:32:49.5112271Z     
2025-05-07T20:32:49.5112515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.5113026Z             op = silu_mul_quant
2025-05-07T20:32:49.5113300Z             if compiled:
2025-05-07T20:32:49.5113656Z                 op = torch.compile(op)
2025-05-07T20:32:49.5113977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.5114274Z     
2025-05-07T20:32:49.5114485Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.5114663Z 
2025-05-07T20:32:49.5114770Z moe/activation_test.py:117: 
2025-05-07T20:32:49.5115087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5115435Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.5115737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.5116330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.5116920Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.5117623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.5118356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.5118929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.5119774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.5120474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.5121045Z     kernel = self.compile(
2025-05-07T20:32:49.5121619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.5122309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.5122734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5122973Z 
2025-05-07T20:32:49.5123199Z self = <triton.compiler.compiler.ASTSource object at 0x7eff087873a0>
2025-05-07T20:32:49.5124587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.5126039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08d2ac20>}
2025-05-07T20:32:49.5127448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.5128521Z context = <triton._C.libtriton.ir.context object at 0x7efdf7ef1270>
2025-05-07T20:32:49.5128826Z 
2025-05-07T20:32:49.5129010Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.5129561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.5130063Z                            module_map=module_map)
2025-05-07T20:32:49.5130456Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.5130834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.5131128Z E       ^
2025-05-07T20:32:49.5131624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.5132099Z 
2025-05-07T20:32:49.5132539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.5133078Z 
2025-05-07T20:32:49.5133198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.5133633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.5134062Z     T=1,
2025-05-07T20:32:49.5134268Z     D=5120,
2025-05-07T20:32:49.5134482Z     scale_ub=None,
2025-05-07T20:32:49.5134718Z     contiguous=False,
2025-05-07T20:32:49.5135100Z     compiled=True,
2025-05-07T20:32:49.5135319Z )
2025-05-07T20:32:49.5827630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.5828389Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.5828757Z 
2025-05-07T20:32:49.5828857Z     @given(
2025-05-07T20:32:49.5829115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.5829468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.5829808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.5830171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.5830537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.5830858Z     )
2025-05-07T20:32:49.5831243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.5831729Z     def test_silu_mul_quant(
2025-05-07T20:32:49.5831998Z         self,
2025-05-07T20:32:49.5832221Z         T: int,
2025-05-07T20:32:49.5832446Z         D: int,
2025-05-07T20:32:49.5832695Z         scale_ub: Optional[float],
2025-05-07T20:32:49.5832995Z         contiguous: bool,
2025-05-07T20:32:49.5833256Z         compiled: bool,
2025-05-07T20:32:49.5833796Z     ) -> None:
2025-05-07T20:32:49.5834037Z         torch.manual_seed(2025)
2025-05-07T20:32:49.5834306Z     
2025-05-07T20:32:49.5834610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.5835110Z     
2025-05-07T20:32:49.5835333Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.5835654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.5835994Z         x = x_sign * x_clamp
2025-05-07T20:32:49.5836255Z         x0 = x[:, :D]
2025-05-07T20:32:49.5836495Z         x1 = x[:, D:]
2025-05-07T20:32:49.5836727Z     
2025-05-07T20:32:49.5836929Z         if contiguous:
2025-05-07T20:32:49.5837191Z             x0 = x0.contiguous()
2025-05-07T20:32:49.5837476Z             x1 = x1.contiguous()
2025-05-07T20:32:49.5837745Z     
2025-05-07T20:32:49.5837961Z         if scale_ub is not None:
2025-05-07T20:32:49.5838268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.5838634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.5838975Z             )
2025-05-07T20:32:49.5839193Z         else:
2025-05-07T20:32:49.5839427Z             scale_ub_tensor = None
2025-05-07T20:32:49.5839701Z     
2025-05-07T20:32:49.5839962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.5840307Z             op = silu_mul_quant
2025-05-07T20:32:49.5840578Z             if compiled:
2025-05-07T20:32:49.5840853Z                 op = torch.compile(op)
2025-05-07T20:32:49.5841183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.5841490Z     
2025-05-07T20:32:49.5841709Z         y_fp8, y_scale = fn()
2025-05-07T20:32:49.5842023Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:49.5842335Z     
2025-05-07T20:32:49.5842603Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.5842971Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:49.5843291Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:49.5843633Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:49.5844028Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.5844366Z     
2025-05-07T20:32:49.5844585Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:49.5844805Z 
2025-05-07T20:32:49.5844915Z moe/activation_test.py:126: 
2025-05-07T20:32:49.5845247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5845610Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:49.5845971Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:49.5846826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:49.5847799Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:49.5848392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.5849133Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.5849882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:49.5850663Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.5851475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:49.5852287Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:49.5853080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:49.5853780Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:49.5854423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:49.5855129Z     fn()
2025-05-07T20:32:49.5855679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:49.5856309Z     self.fn.run(
2025-05-07T20:32:49.5856816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.5857400Z     kernel = self.compile(
2025-05-07T20:32:49.5857992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.5858697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.5859143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.5859392Z 
2025-05-07T20:32:49.5859634Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0843edd0>
2025-05-07T20:32:49.5860805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.5862292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0837f370>}
2025-05-07T20:32:49.5863746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.5864857Z context = <triton._C.libtriton.ir.context object at 0x7efdf7e89d70>
2025-05-07T20:32:49.5865170Z 
2025-05-07T20:32:49.5865363Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.5865930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.5866446Z                            module_map=module_map)
2025-05-07T20:32:49.5866855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.5867250Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:49.5867540Z E       ^
2025-05-07T20:32:49.5868048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.5868536Z 
2025-05-07T20:32:49.5868991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.5869544Z 
2025-05-07T20:32:49.5869666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.5870113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.5870554Z     T=1,
2025-05-07T20:32:49.5870764Z     D=5120,
2025-05-07T20:32:49.5871062Z     scale_ub=None,
2025-05-07T20:32:49.5871305Z     contiguous=True,
2025-05-07T20:32:49.5871555Z     compiled=False,
2025-05-07T20:32:49.5871780Z )
2025-05-07T20:32:49.9339467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9340289Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:49.9340687Z 
2025-05-07T20:32:49.9340811Z     @given(
2025-05-07T20:32:49.9341142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9341475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9341812Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9342172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9342531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9342845Z     )
2025-05-07T20:32:49.9343228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9343722Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9343985Z         self,
2025-05-07T20:32:49.9344208Z         T: int,
2025-05-07T20:32:49.9344442Z         D: int,
2025-05-07T20:32:49.9344718Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9345206Z         contiguous: bool,
2025-05-07T20:32:49.9345472Z         compiled: bool,
2025-05-07T20:32:49.9345712Z     ) -> None:
2025-05-07T20:32:49.9345948Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9346211Z     
2025-05-07T20:32:49.9346503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9346869Z     
2025-05-07T20:32:49.9347082Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9347388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9347721Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9347988Z         x0 = x[:, :D]
2025-05-07T20:32:49.9348218Z         x1 = x[:, D:]
2025-05-07T20:32:49.9348445Z     
2025-05-07T20:32:49.9348653Z         if contiguous:
2025-05-07T20:32:49.9348910Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9349192Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9349457Z     
2025-05-07T20:32:49.9349675Z         if scale_ub is not None:
2025-05-07T20:32:49.9349970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9350339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9350675Z             )
2025-05-07T20:32:49.9350880Z         else:
2025-05-07T20:32:49.9351108Z             scale_ub_tensor = None
2025-05-07T20:32:49.9351378Z     
2025-05-07T20:32:49.9351621Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9351961Z             op = silu_mul_quant
2025-05-07T20:32:49.9352232Z             if compiled:
2025-05-07T20:32:49.9352494Z                 op = torch.compile(op)
2025-05-07T20:32:49.9352817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9353113Z     
2025-05-07T20:32:49.9353316Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9353591Z 
2025-05-07T20:32:49.9353704Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9354021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9354375Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9354679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9355419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9356153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9356720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9357445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9358153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9358721Z     kernel = self.compile(
2025-05-07T20:32:49.9359416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9360114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9360535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9360781Z 
2025-05-07T20:32:49.9361006Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08b2d240>
2025-05-07T20:32:49.9362141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9363597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0837feb0>}
2025-05-07T20:32:49.9365023Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9366110Z context = <triton._C.libtriton.ir.context object at 0x7efdf78b3830>
2025-05-07T20:32:49.9366497Z 
2025-05-07T20:32:49.9366674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9367227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9367724Z                            module_map=module_map)
2025-05-07T20:32:49.9368110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9368479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9368756Z E       ^
2025-05-07T20:32:49.9369245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9369719Z 
2025-05-07T20:32:49.9370164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9370708Z 
2025-05-07T20:32:49.9370820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9371260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9371696Z     T=128,
2025-05-07T20:32:49.9371894Z     D=5120,
2025-05-07T20:32:49.9372105Z     scale_ub=None,
2025-05-07T20:32:49.9372339Z     contiguous=False,
2025-05-07T20:32:49.9372581Z     compiled=True,
2025-05-07T20:32:49.9372803Z )
2025-05-07T20:32:49.9373145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9373660Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.9373950Z 
2025-05-07T20:32:49.9374034Z     @given(
2025-05-07T20:32:49.9374286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9374638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9374988Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9375345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9375694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9375994Z     )
2025-05-07T20:32:49.9376379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9376852Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9377108Z         self,
2025-05-07T20:32:49.9377318Z         T: int,
2025-05-07T20:32:49.9377529Z         D: int,
2025-05-07T20:32:49.9377760Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9378047Z         contiguous: bool,
2025-05-07T20:32:49.9378307Z         compiled: bool,
2025-05-07T20:32:49.9378544Z     ) -> None:
2025-05-07T20:32:49.9378781Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9379040Z     
2025-05-07T20:32:49.9379330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9379686Z     
2025-05-07T20:32:49.9379894Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9380292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9380616Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9380872Z         x0 = x[:, :D]
2025-05-07T20:32:49.9381106Z         x1 = x[:, D:]
2025-05-07T20:32:49.9381329Z     
2025-05-07T20:32:49.9381531Z         if contiguous:
2025-05-07T20:32:49.9381780Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9382053Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9382310Z     
2025-05-07T20:32:49.9382518Z         if scale_ub is not None:
2025-05-07T20:32:49.9382809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9383167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9383499Z             )
2025-05-07T20:32:49.9383703Z         else:
2025-05-07T20:32:49.9383930Z             scale_ub_tensor = None
2025-05-07T20:32:49.9384203Z     
2025-05-07T20:32:49.9384450Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9384790Z             op = silu_mul_quant
2025-05-07T20:32:49.9385067Z             if compiled:
2025-05-07T20:32:49.9385336Z                 op = torch.compile(op)
2025-05-07T20:32:49.9385649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9386031Z     
2025-05-07T20:32:49.9386242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9386420Z 
2025-05-07T20:32:49.9386526Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9386844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9387204Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9387503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9388100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.9388698Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.9389398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9390132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9390704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9391428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9392134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9392702Z     kernel = self.compile(
2025-05-07T20:32:49.9393277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9394027Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9394444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9394698Z 
2025-05-07T20:32:49.9394923Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08b2d270>
2025-05-07T20:32:49.9396071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9397530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fce8c0>}
2025-05-07T20:32:49.9398960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9400045Z context = <triton._C.libtriton.ir.context object at 0x7efdf7868e30>
2025-05-07T20:32:49.9400358Z 
2025-05-07T20:32:49.9400536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9401096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9401702Z                            module_map=module_map)
2025-05-07T20:32:49.9402095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9402475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9402758Z E       ^
2025-05-07T20:32:49.9403247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9403728Z 
2025-05-07T20:32:49.9404168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9404738Z 
2025-05-07T20:32:49.9404876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9410839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9411299Z     T=128,
2025-05-07T20:32:49.9411511Z     D=7168,
2025-05-07T20:32:49.9411713Z     scale_ub=1200.0,
2025-05-07T20:32:49.9411958Z     contiguous=False,
2025-05-07T20:32:49.9412208Z     compiled=False,
2025-05-07T20:32:49.9412433Z )
2025-05-07T20:32:50.0677535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.0678380Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.0679017Z 
2025-05-07T20:32:50.0679139Z     @given(
2025-05-07T20:32:50.0679487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.0679903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.0680233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.0680588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.0680936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.0681243Z     )
2025-05-07T20:32:50.0681618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.0682093Z     def test_silu_mul_quant(
2025-05-07T20:32:50.0682346Z         self,
2025-05-07T20:32:50.0682560Z         T: int,
2025-05-07T20:32:50.0682782Z         D: int,
2025-05-07T20:32:50.0683015Z         scale_ub: Optional[float],
2025-05-07T20:32:50.0683311Z         contiguous: bool,
2025-05-07T20:32:50.0683576Z         compiled: bool,
2025-05-07T20:32:50.0683826Z     ) -> None:
2025-05-07T20:32:50.0684063Z         torch.manual_seed(2025)
2025-05-07T20:32:50.0684326Z     
2025-05-07T20:32:50.0684639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.0685030Z     
2025-05-07T20:32:50.0685242Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.0685548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.0685876Z         x = x_sign * x_clamp
2025-05-07T20:32:50.0686136Z         x0 = x[:, :D]
2025-05-07T20:32:50.0686367Z         x1 = x[:, D:]
2025-05-07T20:32:50.0686595Z     
2025-05-07T20:32:50.0686798Z         if contiguous:
2025-05-07T20:32:50.0687042Z             x0 = x0.contiguous()
2025-05-07T20:32:50.0687320Z             x1 = x1.contiguous()
2025-05-07T20:32:50.0687581Z     
2025-05-07T20:32:50.0687795Z         if scale_ub is not None:
2025-05-07T20:32:50.0688085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.0688447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.0688787Z             )
2025-05-07T20:32:50.0688989Z         else:
2025-05-07T20:32:50.0689219Z             scale_ub_tensor = None
2025-05-07T20:32:50.0689483Z     
2025-05-07T20:32:50.0689732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0690065Z             op = silu_mul_quant
2025-05-07T20:32:50.0690328Z             if compiled:
2025-05-07T20:32:50.0690597Z                 op = torch.compile(op)
2025-05-07T20:32:50.0690914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0691206Z     
2025-05-07T20:32:50.0691421Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.0691595Z 
2025-05-07T20:32:50.0691711Z moe/activation_test.py:117: 
2025-05-07T20:32:50.0692155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0692511Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.0692818Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0693552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.0694283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.0694855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.0695575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.0696278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.0696838Z     kernel = self.compile(
2025-05-07T20:32:50.0697418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.0698120Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.0698533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0698780Z 
2025-05-07T20:32:50.0699001Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08038b80>
2025-05-07T20:32:50.0700222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.0701680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08d2aef0>}
2025-05-07T20:32:50.0703094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.0704161Z context = <triton._C.libtriton.ir.context object at 0x7efdf77548f0>
2025-05-07T20:32:50.0704466Z 
2025-05-07T20:32:50.0704641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.0705215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.0705702Z                            module_map=module_map)
2025-05-07T20:32:50.0706082Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.0706453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.0706725Z E       ^
2025-05-07T20:32:50.0707208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.0707674Z 
2025-05-07T20:32:50.0708105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.0708641Z 
2025-05-07T20:32:50.0708756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.0709190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.0709610Z     T=128,
2025-05-07T20:32:50.0709805Z     D=5120,
2025-05-07T20:32:50.0710017Z     scale_ub=None,
2025-05-07T20:32:50.0710445Z     contiguous=False,
2025-05-07T20:32:50.0710683Z     compiled=False,
2025-05-07T20:32:50.0710902Z )
2025-05-07T20:32:50.0711235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.0711744Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:50.0712030Z 
2025-05-07T20:32:50.0712112Z     @given(
2025-05-07T20:32:50.0712355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.0712681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.0713002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.0713351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.0713836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.0714136Z     )
2025-05-07T20:32:50.0714505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.0715022Z     def test_silu_mul_quant(
2025-05-07T20:32:50.0715278Z         self,
2025-05-07T20:32:50.0715488Z         T: int,
2025-05-07T20:32:50.0715698Z         D: int,
2025-05-07T20:32:50.0715926Z         scale_ub: Optional[float],
2025-05-07T20:32:50.0716218Z         contiguous: bool,
2025-05-07T20:32:50.0716477Z         compiled: bool,
2025-05-07T20:32:50.0716709Z     ) -> None:
2025-05-07T20:32:50.0716938Z         torch.manual_seed(2025)
2025-05-07T20:32:50.0717193Z     
2025-05-07T20:32:50.0717474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
﻿2025-05-07T20:32:50.0720667Z     
2025-05-07T20:32:50.0720877Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.0721213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.0721693Z         x = x_sign * x_clamp
2025-05-07T20:32:50.0721966Z         x0 = x[:, :D]
2025-05-07T20:32:50.0722195Z         x1 = x[:, D:]
2025-05-07T20:32:50.0722418Z     
2025-05-07T20:32:50.0722615Z         if contiguous:
2025-05-07T20:32:50.0722855Z             x0 = x0.contiguous()
2025-05-07T20:32:50.0723202Z             x1 = x1.contiguous()
2025-05-07T20:32:50.0723456Z     
2025-05-07T20:32:50.0723654Z         if scale_ub is not None:
2025-05-07T20:32:50.0724229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.0724585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.0724903Z             )
2025-05-07T20:32:50.0725111Z         else:
2025-05-07T20:32:50.0725334Z             scale_ub_tensor = None
2025-05-07T20:32:50.0725601Z     
2025-05-07T20:32:50.0725843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.0726203Z             op = silu_mul_quant
2025-05-07T20:32:50.0726463Z             if compiled:
2025-05-07T20:32:50.0726728Z                 op = torch.compile(op)
2025-05-07T20:32:50.0727048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0727335Z     
2025-05-07T20:32:50.0727541Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.0727715Z 
2025-05-07T20:32:50.0727826Z moe/activation_test.py:117: 
2025-05-07T20:32:50.0728139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0728486Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.0728785Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.0729506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.0730219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.0730781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.0731501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.0732192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.0732750Z     kernel = self.compile(
2025-05-07T20:32:50.0733317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.0734007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.0734416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.0734655Z 
2025-05-07T20:32:50.0734871Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08fd76d0>
2025-05-07T20:32:50.0735995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.0737559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fccb80>}
2025-05-07T20:32:50.0738955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.0740030Z context = <triton._C.libtriton.ir.context object at 0x7efdf77401f0>
2025-05-07T20:32:50.0740339Z 
2025-05-07T20:32:50.0740515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.0741062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.0741551Z                            module_map=module_map)
2025-05-07T20:32:50.0741934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.0742415Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.0742689Z E       ^
2025-05-07T20:32:50.0743177Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.0743654Z 
2025-05-07T20:32:50.0744087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.0744709Z 
2025-05-07T20:32:50.0744826Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.0745254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.0745672Z     T=128,
2025-05-07T20:32:50.0745870Z     D=5120,
2025-05-07T20:32:50.0746077Z     scale_ub=1200.0,
2025-05-07T20:32:50.0746308Z     contiguous=True,
2025-05-07T20:32:50.0746548Z     compiled=False,
2025-05-07T20:32:50.0746765Z )
2025-05-07T20:32:50.2691157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2692738Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.2693465Z 
2025-05-07T20:32:50.2693628Z     @given(
2025-05-07T20:32:50.2694122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2694698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2695019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2695372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2695719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2696017Z     )
2025-05-07T20:32:50.2696389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2696854Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2697112Z         self,
2025-05-07T20:32:50.2697313Z         T: int,
2025-05-07T20:32:50.2697521Z         D: int,
2025-05-07T20:32:50.2697752Z         scale_ub: Optional[float],
2025-05-07T20:32:50.2698034Z         contiguous: bool,
2025-05-07T20:32:50.2698291Z         compiled: bool,
2025-05-07T20:32:50.2698535Z     ) -> None:
2025-05-07T20:32:50.2698759Z         torch.manual_seed(2025)
2025-05-07T20:32:50.2699016Z     
2025-05-07T20:32:50.2699316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.2699678Z     
2025-05-07T20:32:50.2699881Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.2700194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.2700523Z         x = x_sign * x_clamp
2025-05-07T20:32:50.2700774Z         x0 = x[:, :D]
2025-05-07T20:32:50.2701009Z         x1 = x[:, D:]
2025-05-07T20:32:50.2701231Z     
2025-05-07T20:32:50.2701427Z         if contiguous:
2025-05-07T20:32:50.2701680Z             x0 = x0.contiguous()
2025-05-07T20:32:50.2701958Z             x1 = x1.contiguous()
2025-05-07T20:32:50.2702211Z     
2025-05-07T20:32:50.2702424Z         if scale_ub is not None:
2025-05-07T20:32:50.2702716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.2703068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.2703396Z             )
2025-05-07T20:32:50.2703606Z         else:
2025-05-07T20:32:50.2704025Z             scale_ub_tensor = None
2025-05-07T20:32:50.2704300Z     
2025-05-07T20:32:50.2704554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.2704929Z             op = silu_mul_quant
2025-05-07T20:32:50.2705198Z             if compiled:
2025-05-07T20:32:50.2705465Z                 op = torch.compile(op)
2025-05-07T20:32:50.2705785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2706071Z     
2025-05-07T20:32:50.2706281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.2706454Z 
2025-05-07T20:32:50.2706569Z moe/activation_test.py:117: 
2025-05-07T20:32:50.2706877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2707228Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.2707527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2708336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.2709069Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.2709634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.2710355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.2711109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.2711668Z     kernel = self.compile(
2025-05-07T20:32:50.2712238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.2712930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.2713340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2713681Z 
2025-05-07T20:32:50.2713900Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf776af20>
2025-05-07T20:32:50.2715029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.2716465Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fcff40>}
2025-05-07T20:32:50.2717859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.2718930Z context = <triton._C.libtriton.ir.context object at 0x7efdf778b8b0>
2025-05-07T20:32:50.2719240Z 
2025-05-07T20:32:50.2719420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.2719972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.2720461Z                            module_map=module_map)
2025-05-07T20:32:50.2720848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.2721222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.2721497Z E       ^
2025-05-07T20:32:50.2721987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.2722461Z 
2025-05-07T20:32:50.2722891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.2723422Z 
2025-05-07T20:32:50.2723539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.2724233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.2724662Z     T=1,
2025-05-07T20:32:50.2724862Z     D=7168,
2025-05-07T20:32:50.2725067Z     scale_ub=1200.0,
2025-05-07T20:32:50.2725309Z     contiguous=True,
2025-05-07T20:32:50.2725680Z     compiled=True,
2025-05-07T20:32:50.2725898Z )
2025-05-07T20:32:50.2726237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.2726747Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.2727024Z 
2025-05-07T20:32:50.2727116Z     @given(
2025-05-07T20:32:50.2727357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.2727691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.2728018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.2728362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.2728712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.2729015Z     )
2025-05-07T20:32:50.2729381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.2729925Z     def test_silu_mul_quant(
2025-05-07T20:32:50.2730182Z         self,
2025-05-07T20:32:50.2730393Z         T: int,
2025-05-07T20:32:50.2730610Z         D: int,
2025-05-07T20:32:50.2730844Z         scale_ub: Optional[float],
2025-05-07T20:32:50.2731143Z         contiguous: bool,
2025-05-07T20:32:50.2731399Z         compiled: bool,
2025-05-07T20:32:50.2731703Z     ) -> None:
2025-05-07T20:32:50.2731934Z         torch.manual_seed(2025)
2025-05-07T20:32:50.2732184Z     
2025-05-07T20:32:50.2732473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.2732833Z     
2025-05-07T20:32:50.2733035Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.2733344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.2733671Z         x = x_sign * x_clamp
2025-05-07T20:32:50.2733923Z         x0 = x[:, :D]
2025-05-07T20:32:50.2734156Z         x1 = x[:, D:]
2025-05-07T20:32:50.2734383Z     
2025-05-07T20:32:50.2734587Z         if contiguous:
2025-05-07T20:32:50.2734836Z             x0 = x0.contiguous()
2025-05-07T20:32:50.2735112Z             x1 = x1.contiguous()
2025-05-07T20:32:50.2735368Z     
2025-05-07T20:32:50.2735579Z         if scale_ub is not None:
2025-05-07T20:32:50.2735873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.2736232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.2736558Z             )
2025-05-07T20:32:50.2736770Z         else:
2025-05-07T20:32:50.2736999Z             scale_ub_tensor = None
2025-05-07T20:32:50.2737263Z     
2025-05-07T20:32:50.2737517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.2737853Z             op = silu_mul_quant
2025-05-07T20:32:50.2738115Z             if compiled:
2025-05-07T20:32:50.2738387Z                 op = torch.compile(op)
2025-05-07T20:32:50.2738708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2738995Z     
2025-05-07T20:32:50.2739210Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.2739384Z 
2025-05-07T20:32:50.2739498Z moe/activation_test.py:117: 
2025-05-07T20:32:50.2739818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2740165Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.2740473Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.2741061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.2741644Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.2742338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.2743059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.2743624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.2744340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.2745036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.2745681Z     kernel = self.compile(
2025-05-07T20:32:50.2746248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.2746939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.2747359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.2747597Z 
2025-05-07T20:32:50.2747819Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf776a9e0>
2025-05-07T20:32:50.2748939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.2750370Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fcf640>}
2025-05-07T20:32:50.2751825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.2752894Z context = <triton._C.libtriton.ir.context object at 0x7efdf76b07b0>
2025-05-07T20:32:50.2753237Z 
2025-05-07T20:32:50.2753418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.2754012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.2754504Z                            module_map=module_map)
2025-05-07T20:32:50.2754888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.2755255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.2755535Z E       ^
2025-05-07T20:32:50.2756024Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.2756494Z 
2025-05-07T20:32:50.2756937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.2757468Z 
2025-05-07T20:32:50.2757580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.2758019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.2758448Z     T=1,
2025-05-07T20:32:50.2758642Z     D=7168,
2025-05-07T20:32:50.2758852Z     scale_ub=1200.0,
2025-05-07T20:32:50.2759095Z     contiguous=False,
2025-05-07T20:32:50.2759333Z     compiled=True,
2025-05-07T20:32:50.2759555Z )
2025-05-07T20:32:50.4225454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4226366Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.4226869Z 
2025-05-07T20:32:50.4227022Z     @given(
2025-05-07T20:32:50.4227427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4227983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4228552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4229139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4229734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4230260Z     )
2025-05-07T20:32:50.4230908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4231635Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4232038Z         self,
2025-05-07T20:32:50.4232359Z         T: int,
2025-05-07T20:32:50.4232690Z         D: int,
2025-05-07T20:32:50.4233068Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4233652Z         contiguous: bool,
2025-05-07T20:32:50.4246651Z         compiled: bool,
2025-05-07T20:32:50.4247096Z     ) -> None:
2025-05-07T20:32:50.4247503Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4247965Z     
2025-05-07T20:32:50.4248488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4249104Z     
2025-05-07T20:32:50.4250846Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4251394Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4251970Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4252431Z         x0 = x[:, :D]
2025-05-07T20:32:50.4252834Z         x1 = x[:, D:]
2025-05-07T20:32:50.4253216Z     
2025-05-07T20:32:50.4253565Z         if contiguous:
2025-05-07T20:32:50.4254031Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4254510Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4254949Z     
2025-05-07T20:32:50.4255302Z         if scale_ub is not None:
2025-05-07T20:32:50.4255824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4256447Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4257019Z             )
2025-05-07T20:32:50.4257538Z         else:
2025-05-07T20:32:50.4257919Z             scale_ub_tensor = None
2025-05-07T20:32:50.4258398Z     
2025-05-07T20:32:50.4258829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4259396Z             op = silu_mul_quant
2025-05-07T20:32:50.4259843Z             if compiled:
2025-05-07T20:32:50.4260283Z                 op = torch.compile(op)
2025-05-07T20:32:50.4260933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4261398Z     
2025-05-07T20:32:50.4261741Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4262001Z 
2025-05-07T20:32:50.4262186Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4262722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4263303Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4263825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4264826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.4265859Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.4267077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4268371Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4269342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4270638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4271864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4272857Z     kernel = self.compile(
2025-05-07T20:32:50.4273905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4275048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4275825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4276271Z 
2025-05-07T20:32:50.4276651Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7621750>
2025-05-07T20:32:50.4278667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4281345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5b5b0>}
2025-05-07T20:32:50.4283894Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4285852Z context = <triton._C.libtriton.ir.context object at 0x7efdf76e7870>
2025-05-07T20:32:50.4286412Z 
2025-05-07T20:32:50.4286714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4287820Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4288693Z                            module_map=module_map)
2025-05-07T20:32:50.4289344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4289993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4290471Z E       ^
2025-05-07T20:32:50.4291327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4292174Z 
2025-05-07T20:32:50.4292944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4293921Z 
2025-05-07T20:32:50.4294109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4294865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4295702Z     T=1,
2025-05-07T20:32:50.4296041Z     D=7168,
2025-05-07T20:32:50.4296389Z     scale_ub=None,
2025-05-07T20:32:50.4296773Z     contiguous=False,
2025-05-07T20:32:50.4297183Z     compiled=True,
2025-05-07T20:32:50.4297552Z )
2025-05-07T20:32:50.5331286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.5332344Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.5332655Z 
2025-05-07T20:32:50.5332756Z     @given(
2025-05-07T20:32:50.5333028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.5333395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.5333753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.5334132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.5334517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.5334850Z     )
2025-05-07T20:32:50.5335265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.5335770Z     def test_silu_mul_quant(
2025-05-07T20:32:50.5336056Z         self,
2025-05-07T20:32:50.5336297Z         T: int,
2025-05-07T20:32:50.5336526Z         D: int,
2025-05-07T20:32:50.5336787Z         scale_ub: Optional[float],
2025-05-07T20:32:50.5337104Z         contiguous: bool,
2025-05-07T20:32:50.5337386Z         compiled: bool,
2025-05-07T20:32:50.5337657Z     ) -> None:
2025-05-07T20:32:50.5337912Z         torch.manual_seed(2025)
2025-05-07T20:32:50.5338191Z     
2025-05-07T20:32:50.5338508Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.5338903Z     
2025-05-07T20:32:50.5339124Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.5339458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.5339816Z         x = x_sign * x_clamp
2025-05-07T20:32:50.5340090Z         x0 = x[:, :D]
2025-05-07T20:32:50.5340347Z         x1 = x[:, D:]
2025-05-07T20:32:50.5340590Z     
2025-05-07T20:32:50.5340808Z         if contiguous:
2025-05-07T20:32:50.5341073Z             x0 = x0.contiguous()
2025-05-07T20:32:50.5341381Z             x1 = x1.contiguous()
2025-05-07T20:32:50.5341661Z     
2025-05-07T20:32:50.5341881Z         if scale_ub is not None:
2025-05-07T20:32:50.5342202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.5342594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.5342947Z             )
2025-05-07T20:32:50.5343177Z         else:
2025-05-07T20:32:50.5343424Z             scale_ub_tensor = None
2025-05-07T20:32:50.5343708Z     
2025-05-07T20:32:50.5343979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5344342Z             op = silu_mul_quant
2025-05-07T20:32:50.5344627Z             if compiled:
2025-05-07T20:32:50.5344917Z                 op = torch.compile(op)
2025-05-07T20:32:50.5345261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.5345576Z     
2025-05-07T20:32:50.5345803Z         y_fp8, y_scale = fn()
2025-05-07T20:32:50.5346135Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:50.5346474Z     
2025-05-07T20:32:50.5346929Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.5347319Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:50.5347666Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:50.5348027Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:50.5348443Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5348803Z     
2025-05-07T20:32:50.5349036Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:50.5349266Z 
2025-05-07T20:32:50.5349383Z moe/activation_test.py:126: 
2025-05-07T20:32:50.5349732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5350120Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:50.5350596Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:50.5351501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:50.5352355Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:50.5352975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.5353875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.5354660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:50.5355484Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.5356333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:50.5357181Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:50.5358014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:50.5358745Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:50.5359423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:50.5360016Z     fn()
2025-05-07T20:32:50.5360595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:50.5361250Z     self.fn.run(
2025-05-07T20:32:50.5361784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.5362388Z     kernel = self.compile(
2025-05-07T20:32:50.5363005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.5363743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.5364197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.5364465Z 
2025-05-07T20:32:50.5364711Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf76206a0>
2025-05-07T20:32:50.5366017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.5367623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f59bd0>}
2025-05-07T20:32:50.5369148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.5370317Z context = <triton._C.libtriton.ir.context object at 0x7efdf7b33930>
2025-05-07T20:32:50.5370647Z 
2025-05-07T20:32:50.5370939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.5371541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.5372076Z                            module_map=module_map)
2025-05-07T20:32:50.5372503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.5372922Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:50.5373227Z E       ^
2025-05-07T20:32:50.5373762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.5374280Z 
2025-05-07T20:32:50.5374753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.5375333Z 
2025-05-07T20:32:50.5375515Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.5375988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.5376447Z     T=1,
2025-05-07T20:32:50.5376672Z     D=5120,
2025-05-07T20:32:50.5376897Z     scale_ub=1200.0,
2025-05-07T20:32:50.5377163Z     contiguous=False,
2025-05-07T20:32:50.5377429Z     compiled=True,
2025-05-07T20:32:50.5377664Z )
2025-05-07T20:32:50.8982506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8983923Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.8984685Z 
2025-05-07T20:32:50.8984914Z     @given(
2025-05-07T20:32:50.8985400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8985797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8986165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8986558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8986952Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8987293Z     )
2025-05-07T20:32:50.8987719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8988233Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8988525Z         self,
2025-05-07T20:32:50.8988763Z         T: int,
2025-05-07T20:32:50.8988995Z         D: int,
2025-05-07T20:32:50.8989264Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8989590Z         contiguous: bool,
2025-05-07T20:32:50.8989874Z         compiled: bool,
2025-05-07T20:32:50.8990147Z     ) -> None:
2025-05-07T20:32:50.8990410Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8990697Z     
2025-05-07T20:32:50.8991016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8991417Z     
2025-05-07T20:32:50.8991652Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8991988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8992356Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8992644Z         x0 = x[:, :D]
2025-05-07T20:32:50.8992900Z         x1 = x[:, D:]
2025-05-07T20:32:50.8993153Z     
2025-05-07T20:32:50.8993382Z         if contiguous:
2025-05-07T20:32:50.8993723Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8994031Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8994317Z     
2025-05-07T20:32:50.8994543Z         if scale_ub is not None:
2025-05-07T20:32:50.8994872Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8995302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8995674Z             )
2025-05-07T20:32:50.8995905Z         else:
2025-05-07T20:32:50.8996159Z             scale_ub_tensor = None
2025-05-07T20:32:50.8996470Z     
2025-05-07T20:32:50.8996742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8997118Z             op = silu_mul_quant
2025-05-07T20:32:50.8997420Z             if compiled:
2025-05-07T20:32:50.8997718Z                 op = torch.compile(op)
2025-05-07T20:32:50.8998069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8998397Z     
2025-05-07T20:32:50.8998624Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.8999051Z 
2025-05-07T20:32:50.8999176Z moe/activation_test.py:117: 
2025-05-07T20:32:50.8999530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8999917Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9000260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9000923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.9001584Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.9002349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9003151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9003778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9004663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9005443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9006063Z     kernel = self.compile(
2025-05-07T20:32:50.9006702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9007546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9008013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9008282Z 
2025-05-07T20:32:50.9008533Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf77e4a30>
2025-05-07T20:32:50.9009794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9011398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f583a0>}
2025-05-07T20:32:50.9012949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9014139Z context = <triton._C.libtriton.ir.context object at 0x7efdf7bad3b0>
2025-05-07T20:32:50.9014474Z 
2025-05-07T20:32:50.9014675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9015328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9015873Z                            module_map=module_map)
2025-05-07T20:32:50.9016306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9016721Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9017028Z E       ^
2025-05-07T20:32:50.9017576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9018092Z 
2025-05-07T20:32:50.9018573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9019163Z 
2025-05-07T20:32:50.9019294Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9019769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9020235Z     T=1,
2025-05-07T20:32:50.9020458Z     D=5120,
2025-05-07T20:32:50.9020687Z     scale_ub=1200.0,
2025-05-07T20:32:50.9020958Z     contiguous=False,
2025-05-07T20:32:50.9021230Z     compiled=False,
2025-05-07T20:32:50.9021477Z )
2025-05-07T20:32:50.9021852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.9022426Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.9022734Z 
2025-05-07T20:32:50.9022921Z     @given(
2025-05-07T20:32:50.9023199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.9023566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.9024250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.9024643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.9025038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.9025421Z     )
2025-05-07T20:32:50.9025829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.9026342Z     def test_silu_mul_quant(
2025-05-07T20:32:50.9026634Z         self,
2025-05-07T20:32:50.9026866Z         T: int,
2025-05-07T20:32:50.9027104Z         D: int,
2025-05-07T20:32:50.9027369Z         scale_ub: Optional[float],
2025-05-07T20:32:50.9027773Z         contiguous: bool,
2025-05-07T20:32:50.9028062Z         compiled: bool,
2025-05-07T20:32:50.9028331Z     ) -> None:
2025-05-07T20:32:50.9028587Z         torch.manual_seed(2025)
2025-05-07T20:32:50.9028885Z     
2025-05-07T20:32:50.9029210Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.9029609Z     
2025-05-07T20:32:50.9029844Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.9030264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.9030629Z         x = x_sign * x_clamp
2025-05-07T20:32:50.9030913Z         x0 = x[:, :D]
2025-05-07T20:32:50.9031176Z         x1 = x[:, D:]
2025-05-07T20:32:50.9031430Z     
2025-05-07T20:32:50.9031648Z         if contiguous:
2025-05-07T20:32:50.9031931Z             x0 = x0.contiguous()
2025-05-07T20:32:50.9032242Z             x1 = x1.contiguous()
2025-05-07T20:32:50.9032523Z     
2025-05-07T20:32:50.9032756Z         if scale_ub is not None:
2025-05-07T20:32:50.9033084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.9033472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.9033934Z             )
2025-05-07T20:32:50.9034173Z         else:
2025-05-07T20:32:50.9034422Z             scale_ub_tensor = None
2025-05-07T20:32:50.9034726Z     
2025-05-07T20:32:50.9035003Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.9035366Z             op = silu_mul_quant
2025-05-07T20:32:50.9035668Z             if compiled:
2025-05-07T20:32:50.9035964Z                 op = torch.compile(op)
2025-05-07T20:32:50.9036318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9036637Z     
2025-05-07T20:32:50.9036872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.9037066Z 
2025-05-07T20:32:50.9037195Z moe/activation_test.py:117: 
2025-05-07T20:32:50.9037537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9037929Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.9038266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.9039064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.9039862Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.9040489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.9041283Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.9042045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.9042663Z     kernel = self.compile(
2025-05-07T20:32:50.9043295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.9044057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.9044516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.9044792Z 
2025-05-07T20:32:50.9045034Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7b69180>
2025-05-07T20:32:50.9046406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.9047997Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f58ee0>}
2025-05-07T20:32:50.9049531Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.9050708Z context = <triton._C.libtriton.ir.context object at 0x7efdf7bfa130>
2025-05-07T20:32:50.9051099Z 
2025-05-07T20:32:50.9051296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.9051913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.9052454Z                            module_map=module_map)
2025-05-07T20:32:50.9052886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.9053350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.9053659Z E       ^
2025-05-07T20:32:50.9054194Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.9054719Z 
2025-05-07T20:32:50.9055245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.9055829Z 
2025-05-07T20:32:50.9055959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.9056442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.9056916Z     T=16384,
2025-05-07T20:32:50.9057153Z     D=5120,
2025-05-07T20:32:50.9057392Z     scale_ub=1200.0,
2025-05-07T20:32:50.9057663Z     contiguous=False,
2025-05-07T20:32:50.9057932Z     compiled=True,
2025-05-07T20:32:50.9058176Z )
2025-05-07T20:32:51.0145184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0146753Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.0147498Z 
2025-05-07T20:32:51.0147712Z     @given(
2025-05-07T20:32:51.0148345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0149173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0149767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0150417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0151058Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0151607Z     )
2025-05-07T20:32:51.0152302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0153156Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0153739Z         self,
2025-05-07T20:32:51.0154131Z         T: int,
2025-05-07T20:32:51.0154523Z         D: int,
2025-05-07T20:32:51.0154954Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0155450Z         contiguous: bool,
2025-05-07T20:32:51.0155756Z         compiled: bool,
2025-05-07T20:32:51.0156029Z     ) -> None:
2025-05-07T20:32:51.0156283Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0156568Z     
2025-05-07T20:32:51.0156890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0157280Z     
2025-05-07T20:32:51.0157518Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0157859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0167019Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0167327Z         x0 = x[:, :D]
2025-05-07T20:32:51.0167587Z         x1 = x[:, D:]
2025-05-07T20:32:51.0167849Z     
2025-05-07T20:32:51.0168077Z         if contiguous:
2025-05-07T20:32:51.0168351Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0168860Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0169154Z     
2025-05-07T20:32:51.0169383Z         if scale_ub is not None:
2025-05-07T20:32:51.0169708Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0170116Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0170480Z             )
2025-05-07T20:32:51.0170719Z         else:
2025-05-07T20:32:51.0170978Z             scale_ub_tensor = None
2025-05-07T20:32:51.0171271Z     
2025-05-07T20:32:51.0171553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0171931Z             op = silu_mul_quant
2025-05-07T20:32:51.0172226Z             if compiled:
2025-05-07T20:32:51.0172524Z                 op = torch.compile(op)
2025-05-07T20:32:51.0172879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0173278Z     
2025-05-07T20:32:51.0173515Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0173716Z 
2025-05-07T20:32:51.0173836Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0174195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0174582Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0174920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0175645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0176291Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0177051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0177846Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0178471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0179250Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0180014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0180638Z     kernel = self.compile(
2025-05-07T20:32:51.0181261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0182020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0182474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0182746Z 
2025-05-07T20:32:51.0182987Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf77e4100>
2025-05-07T20:32:51.0184218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0185841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5a9e0>}
2025-05-07T20:32:51.0187372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0188546Z context = <triton._C.libtriton.ir.context object at 0x7efdf7337830>
2025-05-07T20:32:51.0188883Z 
2025-05-07T20:32:51.0189078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0189684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0190220Z                            module_map=module_map)
2025-05-07T20:32:51.0190652Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0191067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0191380Z E       ^
2025-05-07T20:32:51.0191912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0192529Z 
2025-05-07T20:32:51.0193006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0193671Z 
2025-05-07T20:32:51.0193804Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0194284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0194741Z     T=2048,
2025-05-07T20:32:51.0194967Z     D=7168,
2025-05-07T20:32:51.0195220Z     scale_ub=1200.0,
2025-05-07T20:32:51.0195510Z     contiguous=False,
2025-05-07T20:32:51.0195776Z     compiled=True,
2025-05-07T20:32:51.0196018Z )
2025-05-07T20:32:51.0196382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0196953Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.0197328Z 
2025-05-07T20:32:51.0197421Z     @given(
2025-05-07T20:32:51.0197681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0198043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0198397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0198777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0199199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0199527Z     )
2025-05-07T20:32:51.0199930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0200433Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0200724Z         self,
2025-05-07T20:32:51.0200961Z         T: int,
2025-05-07T20:32:51.0201193Z         D: int,
2025-05-07T20:32:51.0201453Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0201773Z         contiguous: bool,
2025-05-07T20:32:51.0202050Z         compiled: bool,
2025-05-07T20:32:51.0202318Z     ) -> None:
2025-05-07T20:32:51.0202577Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0202855Z     
2025-05-07T20:32:51.0203180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0203581Z     
2025-05-07T20:32:51.0203806Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0204151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0204513Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0204800Z         x0 = x[:, :D]
2025-05-07T20:32:51.0205051Z         x1 = x[:, D:]
2025-05-07T20:32:51.0205297Z     
2025-05-07T20:32:51.0205528Z         if contiguous:
2025-05-07T20:32:51.0205800Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0206107Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0206394Z     
2025-05-07T20:32:51.0206621Z         if scale_ub is not None:
2025-05-07T20:32:51.0206946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0207338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0207699Z             )
2025-05-07T20:32:51.0207934Z         else:
2025-05-07T20:32:51.0208192Z             scale_ub_tensor = None
2025-05-07T20:32:51.0208485Z     
2025-05-07T20:32:51.0208764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0209134Z             op = silu_mul_quant
2025-05-07T20:32:51.0209424Z             if compiled:
2025-05-07T20:32:51.0209717Z                 op = torch.compile(op)
2025-05-07T20:32:51.0210084Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0210412Z     
2025-05-07T20:32:51.0210636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0210839Z 
2025-05-07T20:32:51.0210955Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0211302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0211681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0212011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0212659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0213309Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0214161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0214954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0215571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0216355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0217107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0217817Z     kernel = self.compile(
2025-05-07T20:32:51.0218709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0219501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0220040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0220313Z 
2025-05-07T20:32:51.0220558Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf73cab30>
2025-05-07T20:32:51.0221783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0224146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5bb50>}
2025-05-07T20:32:51.0225679Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0226850Z context = <triton._C.libtriton.ir.context object at 0x7efdf73122b0>
2025-05-07T20:32:51.0227180Z 
2025-05-07T20:32:51.0227383Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0227992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0228528Z                            module_map=module_map)
2025-05-07T20:32:51.0228957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0229369Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0229672Z E       ^
2025-05-07T20:32:51.0230211Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0230725Z 
2025-05-07T20:32:51.0231204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0231785Z 
2025-05-07T20:32:51.1597206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.1597900Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.1598566Z     T=1,
2025-05-07T20:32:51.1598870Z     D=5120,
2025-05-07T20:32:51.1599186Z     scale_ub=None,
2025-05-07T20:32:51.1599462Z     contiguous=False,
2025-05-07T20:32:51.1599734Z     compiled=False,
2025-05-07T20:32:51.1599971Z )
2025-05-07T20:32:51.1600341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1600902Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.1601241Z 
2025-05-07T20:32:51.1601335Z     @given(
2025-05-07T20:32:51.1601613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1601978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1602335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1602718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1603103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1603433Z     )
2025-05-07T20:32:51.1603839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1604533Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1604817Z         self,
2025-05-07T20:32:51.1605049Z         T: int,
2025-05-07T20:32:51.1605283Z         D: int,
2025-05-07T20:32:51.1605534Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1605853Z         contiguous: bool,
2025-05-07T20:32:51.1606134Z         compiled: bool,
2025-05-07T20:32:51.1606395Z     ) -> None:
2025-05-07T20:32:51.1606653Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1606935Z     
2025-05-07T20:32:51.1607253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1607643Z     
2025-05-07T20:32:51.1607873Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1608211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1608563Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1608950Z         x0 = x[:, :D]
2025-05-07T20:32:51.1609198Z         x1 = x[:, D:]
2025-05-07T20:32:51.1609433Z     
2025-05-07T20:32:51.1609648Z         if contiguous:
2025-05-07T20:32:51.1609924Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1610219Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1610501Z     
2025-05-07T20:32:51.1610730Z         if scale_ub is not None:
2025-05-07T20:32:51.1611050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1611510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1611867Z             )
2025-05-07T20:32:51.1612093Z         else:
2025-05-07T20:32:51.1612348Z             scale_ub_tensor = None
2025-05-07T20:32:51.1612645Z     
2025-05-07T20:32:51.1612914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1613277Z             op = silu_mul_quant
2025-05-07T20:32:51.1613571Z             if compiled:
2025-05-07T20:32:51.1613863Z                 op = torch.compile(op)
2025-05-07T20:32:51.1614205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1614525Z     
2025-05-07T20:32:51.1614757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1614947Z 
2025-05-07T20:32:51.1615070Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1615414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1615799Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1616124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1616900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1617672Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1618278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1619037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1619775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1620375Z     kernel = self.compile(
2025-05-07T20:32:51.1620983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1621718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1622170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1622435Z 
2025-05-07T20:32:51.1622682Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf73a6c20>
2025-05-07T20:32:51.1624200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1625786Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a085e0>}
2025-05-07T20:32:51.1627422Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1628558Z context = <triton._C.libtriton.ir.context object at 0x7efdf7aa7ef0>
2025-05-07T20:32:51.1628884Z 
2025-05-07T20:32:51.1629079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1629657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1630184Z                            module_map=module_map)
2025-05-07T20:32:51.1630601Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1630995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1631292Z E       ^
2025-05-07T20:32:51.1631815Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1632386Z 
2025-05-07T20:32:51.1632858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1633423Z 
2025-05-07T20:32:51.1633643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.1634212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.1634757Z     T=4096,
2025-05-07T20:32:51.1634974Z     D=7168,
2025-05-07T20:32:51.1635212Z     scale_ub=1200.0,
2025-05-07T20:32:51.1635477Z     contiguous=False,
2025-05-07T20:32:51.1635733Z     compiled=False,
2025-05-07T20:32:51.1635976Z )
2025-05-07T20:32:51.1636341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1636902Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.1637212Z 
2025-05-07T20:32:51.1637307Z     @given(
2025-05-07T20:32:51.1637582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1637944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1638294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1638681Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1639060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1639383Z     )
2025-05-07T20:32:51.1639785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1640291Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1640573Z         self,
2025-05-07T20:32:51.1640798Z         T: int,
2025-05-07T20:32:51.1641031Z         D: int,
2025-05-07T20:32:51.1641290Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1641600Z         contiguous: bool,
2025-05-07T20:32:51.1641881Z         compiled: bool,
2025-05-07T20:32:51.1642148Z     ) -> None:
2025-05-07T20:32:51.1642398Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1642692Z     
2025-05-07T20:32:51.1643012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1643398Z     
2025-05-07T20:32:51.1643627Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1643965Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1644312Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1644701Z         x0 = x[:, :D]
2025-05-07T20:32:51.1645074Z         x1 = x[:, D:]
2025-05-07T20:32:51.1645396Z     
2025-05-07T20:32:51.1645617Z         if contiguous:
2025-05-07T20:32:51.1645888Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1646185Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1646463Z     
2025-05-07T20:32:51.1646689Z         if scale_ub is not None:
2025-05-07T20:32:51.1647006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1647384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1647741Z             )
2025-05-07T20:32:51.1647973Z         else:
2025-05-07T20:32:51.1648212Z             scale_ub_tensor = None
2025-05-07T20:32:51.1648508Z     
2025-05-07T20:32:51.1648777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1649129Z             op = silu_mul_quant
2025-05-07T20:32:51.1649531Z             if compiled:
2025-05-07T20:32:51.1649824Z                 op = torch.compile(op)
2025-05-07T20:32:51.1650159Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1650478Z     
2025-05-07T20:32:51.1650709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1650897Z 
2025-05-07T20:32:51.1651010Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1651350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1651726Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1652052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1652813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1653648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1654249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1655004Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1655742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1656382Z     kernel = self.compile(
2025-05-07T20:32:51.1656989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1657714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1658161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1658414Z 
2025-05-07T20:32:51.1658653Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf731c880>
2025-05-07T20:32:51.1659845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1661344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a08ca0>}
2025-05-07T20:32:51.1662830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1663959Z context = <triton._C.libtriton.ir.context object at 0x7efdf7aaf170>
2025-05-07T20:32:51.1664279Z 
2025-05-07T20:32:51.1664472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1665052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1665581Z                            module_map=module_map)
2025-05-07T20:32:51.1665993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1666392Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1666681Z E       ^
2025-05-07T20:32:51.1667196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1667695Z 
2025-05-07T20:32:51.1668156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1668715Z 
2025-05-07T20:32:51.1668841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.1669297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.1669746Z     T=16384,
2025-05-07T20:32:51.1669971Z     D=7168,
2025-05-07T20:32:51.1670188Z     scale_ub=None,
2025-05-07T20:32:51.1670433Z     contiguous=True,
2025-05-07T20:32:51.1670693Z     compiled=True,
2025-05-07T20:32:51.1670925Z )
2025-05-07T20:32:51.3785881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3786825Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.3787347Z 
2025-05-07T20:32:51.3787485Z     @given(
2025-05-07T20:32:51.3787864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3788352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3788727Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3789127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3789518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3789862Z     )
2025-05-07T20:32:51.3790279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3790802Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3791088Z         self,
2025-05-07T20:32:51.3791329Z         T: int,
2025-05-07T20:32:51.3791673Z         D: int,
2025-05-07T20:32:51.3791935Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3792263Z         contiguous: bool,
2025-05-07T20:32:51.3792553Z         compiled: bool,
2025-05-07T20:32:51.3792834Z     ) -> None:
2025-05-07T20:32:51.3793096Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3793392Z     
2025-05-07T20:32:51.3793805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3794287Z     
2025-05-07T20:32:51.3794520Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3794862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3795230Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3795520Z         x0 = x[:, :D]
2025-05-07T20:32:51.3795777Z         x1 = x[:, D:]
2025-05-07T20:32:51.3796028Z     
2025-05-07T20:32:51.3796258Z         if contiguous:
2025-05-07T20:32:51.3796533Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3796845Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3797135Z     
2025-05-07T20:32:51.3797377Z         if scale_ub is not None:
2025-05-07T20:32:51.3797699Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3798101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3798493Z             )
2025-05-07T20:32:51.3798816Z         else:
2025-05-07T20:32:51.3799116Z             scale_ub_tensor = None
2025-05-07T20:32:51.3799418Z     
2025-05-07T20:32:51.3799693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3800074Z             op = silu_mul_quant
2025-05-07T20:32:51.3800375Z             if compiled:
2025-05-07T20:32:51.3800667Z                 op = torch.compile(op)
2025-05-07T20:32:51.3801020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3801350Z     
2025-05-07T20:32:51.3801580Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3801783Z 
2025-05-07T20:32:51.3801903Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3802255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3802650Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3802985Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3803647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3804310Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3805082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3805890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3806510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3807307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3808078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3808698Z     kernel = self.compile(
2025-05-07T20:32:51.3809327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3810222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3810695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3810961Z 
2025-05-07T20:32:51.3811204Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7ab2380>
2025-05-07T20:32:51.3812453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3820973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a09b40>}
2025-05-07T20:32:51.3822535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3824169Z context = <triton._C.libtriton.ir.context object at 0x7efdf74fbcb0>
2025-05-07T20:32:51.3824521Z 
2025-05-07T20:32:51.3824721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3825451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3826016Z                            module_map=module_map)
2025-05-07T20:32:51.3826446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3826858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3827160Z E       ^
2025-05-07T20:32:51.3827698Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3828215Z 
2025-05-07T20:32:51.3828690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3829277Z 
2025-05-07T20:32:51.3829406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3829886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3830350Z     T=4096,
2025-05-07T20:32:51.3830579Z     D=5120,
2025-05-07T20:32:51.3830806Z     scale_ub=None,
2025-05-07T20:32:51.3831070Z     contiguous=False,
2025-05-07T20:32:51.3831336Z     compiled=True,
2025-05-07T20:32:51.3831576Z )
2025-05-07T20:32:51.3831941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3832510Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.3832821Z 
2025-05-07T20:32:51.3832918Z     @given(
2025-05-07T20:32:51.3833189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3833624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3833991Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3834371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3834812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3835230Z     )
2025-05-07T20:32:51.3835745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3836375Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3836736Z         self,
2025-05-07T20:32:51.3837019Z         T: int,
2025-05-07T20:32:51.3837284Z         D: int,
2025-05-07T20:32:51.3837543Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3837863Z         contiguous: bool,
2025-05-07T20:32:51.3838142Z         compiled: bool,
2025-05-07T20:32:51.3838406Z     ) -> None:
2025-05-07T20:32:51.3838666Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3838944Z     
2025-05-07T20:32:51.3839267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3839669Z     
2025-05-07T20:32:51.3839896Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3840237Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3840598Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3841019Z         x0 = x[:, :D]
2025-05-07T20:32:51.3841283Z         x1 = x[:, D:]
2025-05-07T20:32:51.3841535Z     
2025-05-07T20:32:51.3841750Z         if contiguous:
2025-05-07T20:32:51.3842028Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3842337Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3842617Z     
2025-05-07T20:32:51.3842847Z         if scale_ub is not None:
2025-05-07T20:32:51.3843167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3843555Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3843915Z             )
2025-05-07T20:32:51.3844147Z         else:
2025-05-07T20:32:51.3844393Z             scale_ub_tensor = None
2025-05-07T20:32:51.3844682Z     
2025-05-07T20:32:51.3844955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3845423Z             op = silu_mul_quant
2025-05-07T20:32:51.3845732Z             if compiled:
2025-05-07T20:32:51.3846022Z                 op = torch.compile(op)
2025-05-07T20:32:51.3846374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3846687Z     
2025-05-07T20:32:51.3846914Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3847105Z 
2025-05-07T20:32:51.3847227Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3847621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3848010Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3848341Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3848988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.3849628Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.3850385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3851176Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3851795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3852575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3853333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3853950Z     kernel = self.compile(
2025-05-07T20:32:51.3854571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3855363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3855831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3856094Z 
2025-05-07T20:32:51.3856341Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7a884f0>
2025-05-07T20:32:51.3857574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3859144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a09240>}
2025-05-07T20:32:51.3860683Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3861854Z context = <triton._C.libtriton.ir.context object at 0x7efdf749e9f0>
2025-05-07T20:32:51.3862189Z 
2025-05-07T20:32:51.3862382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3862982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3863526Z                            module_map=module_map)
2025-05-07T20:32:51.3864042Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3864452Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3864757Z E       ^
2025-05-07T20:32:51.3865345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3865863Z 
2025-05-07T20:32:51.3866337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3866924Z 
2025-05-07T20:32:51.7479112Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7480348Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7481479Z     T=4096,
2025-05-07T20:32:51.7482009Z     D=5120,
2025-05-07T20:32:51.7482487Z     scale_ub=1200.0,
2025-05-07T20:32:51.7483165Z     contiguous=False,
2025-05-07T20:32:51.7483618Z     compiled=False,
2025-05-07T20:32:51.7484027Z )
2025-05-07T20:32:51.7484670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7485542Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.7485884Z 
2025-05-07T20:32:51.7485977Z     @given(
2025-05-07T20:32:51.7486247Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7486689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7487040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7487424Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7487811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7488146Z     )
2025-05-07T20:32:51.7488549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7489067Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7489351Z         self,
2025-05-07T20:32:51.7489579Z         T: int,
2025-05-07T20:32:51.7489812Z         D: int,
2025-05-07T20:32:51.7490080Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7490394Z         contiguous: bool,
2025-05-07T20:32:51.7490684Z         compiled: bool,
2025-05-07T20:32:51.7490948Z     ) -> None:
2025-05-07T20:32:51.7491202Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7491487Z     
2025-05-07T20:32:51.7491804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7492198Z     
2025-05-07T20:32:51.7492426Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7492793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7493154Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7493435Z         x0 = x[:, :D]
2025-05-07T20:32:51.7493686Z         x1 = x[:, D:]
2025-05-07T20:32:51.7493932Z     
2025-05-07T20:32:51.7494157Z         if contiguous:
2025-05-07T20:32:51.7494425Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7494728Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7495018Z     
2025-05-07T20:32:51.7495273Z         if scale_ub is not None:
2025-05-07T20:32:51.7495627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7496020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7496379Z             )
2025-05-07T20:32:51.7496609Z         else:
2025-05-07T20:32:51.7496858Z             scale_ub_tensor = None
2025-05-07T20:32:51.7497155Z     
2025-05-07T20:32:51.7497421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7497786Z             op = silu_mul_quant
2025-05-07T20:32:51.7498188Z             if compiled:
2025-05-07T20:32:51.7498474Z                 op = torch.compile(op)
2025-05-07T20:32:51.7498818Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7499139Z     
2025-05-07T20:32:51.7499360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7499556Z 
2025-05-07T20:32:51.7499671Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7500024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7500404Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7500879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7501672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7502453Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7503067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7503845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7504597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7505205Z     kernel = self.compile(
2025-05-07T20:32:51.7505820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7506620Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7507083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7507343Z 
2025-05-07T20:32:51.7507579Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf74423e0>
2025-05-07T20:32:51.7508804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7510410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a0acb0>}
2025-05-07T20:32:51.7511932Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7513092Z context = <triton._C.libtriton.ir.context object at 0x7efdf74afdf0>
2025-05-07T20:32:51.7513419Z 
2025-05-07T20:32:51.7513699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7514298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7514840Z                            module_map=module_map)
2025-05-07T20:32:51.7515256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7515695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7515990Z E       ^
2025-05-07T20:32:51.7516516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7517034Z 
2025-05-07T20:32:51.7517503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7518088Z 
2025-05-07T20:32:51.7518205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7518677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7519128Z     T=4096,
2025-05-07T20:32:51.7519346Z     D=5120,
2025-05-07T20:32:51.7519567Z     scale_ub=1200.0,
2025-05-07T20:32:51.7519823Z     contiguous=False,
2025-05-07T20:32:51.7520074Z     compiled=True,
2025-05-07T20:32:51.7520310Z )
2025-05-07T20:32:51.7520669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7521230Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.7521544Z 
2025-05-07T20:32:51.7521631Z     @given(
2025-05-07T20:32:51.7521890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7522242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7522596Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7522973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7523348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7523676Z     )
2025-05-07T20:32:51.7524609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7525117Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7525397Z         self,
2025-05-07T20:32:51.7525623Z         T: int,
2025-05-07T20:32:51.7525852Z         D: int,
2025-05-07T20:32:51.7526104Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7526415Z         contiguous: bool,
2025-05-07T20:32:51.7526691Z         compiled: bool,
2025-05-07T20:32:51.7526943Z     ) -> None:
2025-05-07T20:32:51.7527192Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7527469Z     
2025-05-07T20:32:51.7527776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7528165Z     
2025-05-07T20:32:51.7528390Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.7528716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.7529143Z         x = x_sign * x_clamp
2025-05-07T20:32:51.7529424Z         x0 = x[:, :D]
2025-05-07T20:32:51.7529669Z         x1 = x[:, D:]
2025-05-07T20:32:51.7529912Z     
2025-05-07T20:32:51.7530135Z         if contiguous:
2025-05-07T20:32:51.7530404Z             x0 = x0.contiguous()
2025-05-07T20:32:51.7530697Z             x1 = x1.contiguous()
2025-05-07T20:32:51.7530974Z     
2025-05-07T20:32:51.7531272Z         if scale_ub is not None:
2025-05-07T20:32:51.7531581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.7531964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.7532320Z             )
2025-05-07T20:32:51.7532537Z         else:
2025-05-07T20:32:51.7532781Z             scale_ub_tensor = None
2025-05-07T20:32:51.7533072Z     
2025-05-07T20:32:51.7533333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.7533693Z             op = silu_mul_quant
2025-05-07T20:32:51.7533989Z             if compiled:
2025-05-07T20:32:51.7534271Z                 op = torch.compile(op)
2025-05-07T20:32:51.7534614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7534933Z     
2025-05-07T20:32:51.7535158Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.7535355Z 
2025-05-07T20:32:51.7535469Z moe/activation_test.py:117: 
2025-05-07T20:32:51.7535810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7536192Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.7536511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.7537145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.7537780Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.7538517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.7539293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.7539903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.7540676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.7541417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.7542019Z     kernel = self.compile(
2025-05-07T20:32:51.7542631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.7543369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7543813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.7544080Z 
2025-05-07T20:32:51.7544314Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7594790>
2025-05-07T20:32:51.7545525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.7547171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a0ab90>}
2025-05-07T20:32:51.7548675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.7549832Z context = <triton._C.libtriton.ir.context object at 0x7efdf75ac470>
2025-05-07T20:32:51.7550164Z 
2025-05-07T20:32:51.7550353Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.7550945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7551471Z                            module_map=module_map)
2025-05-07T20:32:51.7551934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7552336Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7552627Z E       ^
2025-05-07T20:32:51.7553157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7553756Z 
2025-05-07T20:32:51.7554223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.7554847Z 
2025-05-07T20:32:51.8943817Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8944615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8945303Z     T=2048,
2025-05-07T20:32:51.8945847Z     D=7168,
2025-05-07T20:32:51.8946392Z     scale_ub=1200.0,
2025-05-07T20:32:51.8946948Z     contiguous=False,
2025-05-07T20:32:51.8947386Z     compiled=False,
2025-05-07T20:32:51.8947783Z )
2025-05-07T20:32:51.8948396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8949349Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.8949883Z 
2025-05-07T20:32:51.8950044Z     @given(
2025-05-07T20:32:51.8950481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8951085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8951684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8952321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8952942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8953492Z     )
2025-05-07T20:32:51.8954290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8955122Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8955589Z         self,
2025-05-07T20:32:51.8955860Z         T: int,
2025-05-07T20:32:51.8956087Z         D: int,
2025-05-07T20:32:51.8956347Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8956668Z         contiguous: bool,
2025-05-07T20:32:51.8956942Z         compiled: bool,
2025-05-07T20:32:51.8957206Z     ) -> None:
2025-05-07T20:32:51.8957467Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8957743Z     
2025-05-07T20:32:51.8958061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8958454Z     
2025-05-07T20:32:51.8958685Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8959016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8959372Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8959650Z         x0 = x[:, :D]
2025-05-07T20:32:51.8959897Z         x1 = x[:, D:]
2025-05-07T20:32:51.8960140Z     
2025-05-07T20:32:51.8960359Z         if contiguous:
2025-05-07T20:32:51.8960626Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8960927Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8961207Z     
2025-05-07T20:32:51.8961429Z         if scale_ub is not None:
2025-05-07T20:32:51.8961751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8962137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8962487Z             )
2025-05-07T20:32:51.8962907Z         else:
2025-05-07T20:32:51.8963161Z             scale_ub_tensor = None
2025-05-07T20:32:51.8963449Z     
2025-05-07T20:32:51.8963720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8964083Z             op = silu_mul_quant
2025-05-07T20:32:51.8964373Z             if compiled:
2025-05-07T20:32:51.8964659Z                 op = torch.compile(op)
2025-05-07T20:32:51.8965000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8965342Z     
2025-05-07T20:32:51.8965589Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8965816Z 
2025-05-07T20:32:51.8965934Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8966276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8966652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8967055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8967852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8968643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8969254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8970119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8970879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8971486Z     kernel = self.compile(
2025-05-07T20:32:51.8972113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8972867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8973325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8973589Z 
2025-05-07T20:32:51.8973831Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7432650>
2025-05-07T20:32:51.8975067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8976651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf75285e0>}
2025-05-07T20:32:51.8978187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8979350Z context = <triton._C.libtriton.ir.context object at 0x7efdf7579870>
2025-05-07T20:32:51.8979690Z 
2025-05-07T20:32:51.8979884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8980491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8981032Z                            module_map=module_map)
2025-05-07T20:32:51.8981449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8981858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8982161Z E       ^
2025-05-07T20:32:51.8982684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8983201Z 
2025-05-07T20:32:51.8983672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8984260Z 
2025-05-07T20:32:51.8984380Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8984853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8985308Z     T=1,
2025-05-07T20:32:51.8985521Z     D=7168,
2025-05-07T20:32:51.8985744Z     scale_ub=None,
2025-05-07T20:32:51.8986084Z     contiguous=True,
2025-05-07T20:32:51.8986347Z     compiled=False,
2025-05-07T20:32:51.8986587Z )
2025-05-07T20:32:51.8986949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8987506Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.8987810Z 
2025-05-07T20:32:51.8987899Z     @given(
2025-05-07T20:32:51.8988165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8988518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8988875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8989255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8989629Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8989971Z     )
2025-05-07T20:32:51.8990428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8990937Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8991228Z         self,
2025-05-07T20:32:51.8991449Z         T: int,
2025-05-07T20:32:51.8991683Z         D: int,
2025-05-07T20:32:51.8991938Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8992248Z         contiguous: bool,
2025-05-07T20:32:51.8992604Z         compiled: bool,
2025-05-07T20:32:51.8992862Z     ) -> None:
2025-05-07T20:32:51.8993107Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8993389Z     
2025-05-07T20:32:51.8993775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8994163Z     
2025-05-07T20:32:51.8994392Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9001453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9001854Z         x = x_sign * x_clamp
2025-05-07T20:32:51.9002139Z         x0 = x[:, :D]
2025-05-07T20:32:51.9002403Z         x1 = x[:, D:]
2025-05-07T20:32:51.9002646Z     
2025-05-07T20:32:51.9002867Z         if contiguous:
2025-05-07T20:32:51.9003141Z             x0 = x0.contiguous()
2025-05-07T20:32:51.9003465Z             x1 = x1.contiguous()
2025-05-07T20:32:51.9003748Z     
2025-05-07T20:32:51.9003972Z         if scale_ub is not None:
2025-05-07T20:32:51.9004292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.9004689Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.9005044Z             )
2025-05-07T20:32:51.9005278Z         else:
2025-05-07T20:32:51.9005535Z             scale_ub_tensor = None
2025-05-07T20:32:51.9005822Z     
2025-05-07T20:32:51.9006101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.9006470Z             op = silu_mul_quant
2025-05-07T20:32:51.9006759Z             if compiled:
2025-05-07T20:32:51.9007049Z                 op = torch.compile(op)
2025-05-07T20:32:51.9007391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9007709Z     
2025-05-07T20:32:51.9007939Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.9008132Z 
2025-05-07T20:32:51.9008253Z moe/activation_test.py:117: 
2025-05-07T20:32:51.9008603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9008987Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.9009318Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9010112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.9010896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.9011513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.9012300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.9013050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.9013664Z     kernel = self.compile(
2025-05-07T20:32:51.9014402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.9015153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.9015602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9015873Z 
2025-05-07T20:32:51.9016112Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7581e10>
2025-05-07T20:32:51.9017338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.9018906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7528d30>}
2025-05-07T20:32:51.9020483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.9021653Z context = <triton._C.libtriton.ir.context object at 0x7efdf7949430>
2025-05-07T20:32:51.9021988Z 
2025-05-07T20:32:51.9022178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.9022821Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.9023350Z                            module_map=module_map)
2025-05-07T20:32:51.9024052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.9024463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.9024761Z E       ^
2025-05-07T20:32:51.9025293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.9025860Z 
2025-05-07T20:32:51.9026336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.9026922Z 
2025-05-07T20:32:51.9027052Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9027520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9027979Z     T=16384,
2025-05-07T20:32:51.9028213Z     D=7168,
2025-05-07T20:32:51.9028433Z     scale_ub=1200.0,
2025-05-07T20:32:51.9028693Z     contiguous=False,
2025-05-07T20:32:51.9028952Z     compiled=True,
2025-05-07T20:32:52.1892796Z )
2025-05-07T20:32:52.1893621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1894418Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.1894829Z 
2025-05-07T20:32:52.1894917Z     @given(
2025-05-07T20:32:52.1895177Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1895531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1895867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1896239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1896608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1896921Z     )
2025-05-07T20:32:52.1897320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1897816Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1898084Z         self,
2025-05-07T20:32:52.1898295Z         T: int,
2025-05-07T20:32:52.1898513Z         D: int,
2025-05-07T20:32:52.1898755Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1899051Z         contiguous: bool,
2025-05-07T20:32:52.1899315Z         compiled: bool,
2025-05-07T20:32:52.1899565Z     ) -> None:
2025-05-07T20:32:52.1899800Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1900072Z     
2025-05-07T20:32:52.1900374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1900748Z     
2025-05-07T20:32:52.1900968Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1901492Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1901834Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1902103Z         x0 = x[:, :D]
2025-05-07T20:32:52.1902342Z         x1 = x[:, D:]
2025-05-07T20:32:52.1902569Z     
2025-05-07T20:32:52.1902781Z         if contiguous:
2025-05-07T20:32:52.1903039Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1903320Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1903589Z     
2025-05-07T20:32:52.1903805Z         if scale_ub is not None:
2025-05-07T20:32:52.1904111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1904476Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1904817Z             )
2025-05-07T20:32:52.1905034Z         else:
2025-05-07T20:32:52.1905265Z             scale_ub_tensor = None
2025-05-07T20:32:52.1905617Z     
2025-05-07T20:32:52.1905874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1906215Z             op = silu_mul_quant
2025-05-07T20:32:52.1906496Z             if compiled:
2025-05-07T20:32:52.1906777Z                 op = torch.compile(op)
2025-05-07T20:32:52.1907099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1907401Z     
2025-05-07T20:32:52.1907614Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1907867Z 
2025-05-07T20:32:52.1907978Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1908307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1908677Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1908989Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1909608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.1910229Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.1910958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1911720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1912315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1913071Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1913872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1914460Z     kernel = self.compile(
2025-05-07T20:32:52.1915062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1915793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1916233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1916489Z 
2025-05-07T20:32:52.1916723Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf75949a0>
2025-05-07T20:32:52.1917924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1919516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7529bd0>}
2025-05-07T20:32:52.1921010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1922144Z context = <triton._C.libtriton.ir.context object at 0x7efdf794db30>
2025-05-07T20:32:52.1922462Z 
2025-05-07T20:32:52.1922646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1923222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1924140Z                            module_map=module_map)
2025-05-07T20:32:52.1924561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1924949Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1925239Z E       ^
2025-05-07T20:32:52.1925754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1926254Z 
2025-05-07T20:32:52.1926713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1927283Z 
2025-05-07T20:32:52.1927398Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1927859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1928314Z     T=1,
2025-05-07T20:32:52.1928581Z     D=7168,
2025-05-07T20:32:52.1928803Z     scale_ub=None,
2025-05-07T20:32:52.1929045Z     contiguous=False,
2025-05-07T20:32:52.1929288Z     compiled=False,
2025-05-07T20:32:52.1929515Z )
2025-05-07T20:32:52.1929874Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.1930409Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.1930703Z 
2025-05-07T20:32:52.1930857Z     @given(
2025-05-07T20:32:52.1931115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.1931454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.1931796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.1932161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.1932529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.1932839Z     )
2025-05-07T20:32:52.1933247Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.1933736Z     def test_silu_mul_quant(
2025-05-07T20:32:52.1934002Z         self,
2025-05-07T20:32:52.1934211Z         T: int,
2025-05-07T20:32:52.1934430Z         D: int,
2025-05-07T20:32:52.1934676Z         scale_ub: Optional[float],
2025-05-07T20:32:52.1934998Z         contiguous: bool,
2025-05-07T20:32:52.1935328Z         compiled: bool,
2025-05-07T20:32:52.1935636Z     ) -> None:
2025-05-07T20:32:52.1935932Z         torch.manual_seed(2025)
2025-05-07T20:32:52.1936271Z     
2025-05-07T20:32:52.1936647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.1937113Z     
2025-05-07T20:32:52.1937380Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.1937741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.1938081Z         x = x_sign * x_clamp
2025-05-07T20:32:52.1938351Z         x0 = x[:, :D]
2025-05-07T20:32:52.1938591Z         x1 = x[:, D:]
2025-05-07T20:32:52.1938820Z     
2025-05-07T20:32:52.1939023Z         if contiguous:
2025-05-07T20:32:52.1939282Z             x0 = x0.contiguous()
2025-05-07T20:32:52.1939566Z             x1 = x1.contiguous()
2025-05-07T20:32:52.1939830Z     
2025-05-07T20:32:52.1940044Z         if scale_ub is not None:
2025-05-07T20:32:52.1940361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.1940740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.1941079Z             )
2025-05-07T20:32:52.1941294Z         else:
2025-05-07T20:32:52.1941521Z             scale_ub_tensor = None
2025-05-07T20:32:52.1941796Z     
2025-05-07T20:32:52.1942047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.1942388Z             op = silu_mul_quant
2025-05-07T20:32:52.1942663Z             if compiled:
2025-05-07T20:32:52.1942935Z                 op = torch.compile(op)
2025-05-07T20:32:52.1943254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1943555Z     
2025-05-07T20:32:52.1943765Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.1943945Z 
2025-05-07T20:32:52.1944060Z moe/activation_test.py:117: 
2025-05-07T20:32:52.1944383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1944867Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.1945188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.1945948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.1946719Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.1947315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.1948083Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.1948815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.1949408Z     kernel = self.compile(
2025-05-07T20:32:52.1950010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.1950775Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.1951219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.1951480Z 
2025-05-07T20:32:52.1951711Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf75e4760>
2025-05-07T20:32:52.1952948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.1954537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf752a050>}
2025-05-07T20:32:52.1956027Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.1957173Z context = <triton._C.libtriton.ir.context object at 0x7efdf795f670>
2025-05-07T20:32:52.1957493Z 
2025-05-07T20:32:52.1957685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.1958266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.1958787Z                            module_map=module_map)
2025-05-07T20:32:52.1959195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.1959589Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.1959874Z E       ^
2025-05-07T20:32:52.1960392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.1960891Z 
2025-05-07T20:32:52.1961357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.1961928Z 
2025-05-07T20:32:52.1962049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.1962509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.1962962Z     T=2048,
2025-05-07T20:32:52.1963175Z     D=7168,
2025-05-07T20:32:52.1963386Z     scale_ub=None,
2025-05-07T20:32:52.1963632Z     contiguous=False,
2025-05-07T20:32:52.1963888Z     compiled=True,
2025-05-07T20:32:52.1964113Z )
2025-05-07T20:32:52.3048097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3048936Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3049399Z 
2025-05-07T20:32:52.3049539Z     @given(
2025-05-07T20:32:52.3049911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3050400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3050818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3051210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3051591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3051922Z     )
2025-05-07T20:32:52.3052540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3053053Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3053332Z         self,
2025-05-07T20:32:52.3053570Z         T: int,
2025-05-07T20:32:52.3053803Z         D: int,
2025-05-07T20:32:52.3054056Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3054375Z         contiguous: bool,
2025-05-07T20:32:52.3054657Z         compiled: bool,
2025-05-07T20:32:52.3054919Z     ) -> None:
2025-05-07T20:32:52.3055173Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3055457Z     
2025-05-07T20:32:52.3055772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3056170Z     
2025-05-07T20:32:52.3056398Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3056802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3057162Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3057443Z         x0 = x[:, :D]
2025-05-07T20:32:52.3057705Z         x1 = x[:, D:]
2025-05-07T20:32:52.3057946Z     
2025-05-07T20:32:52.3058167Z         if contiguous:
2025-05-07T20:32:52.3058444Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3058741Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3059095Z     
2025-05-07T20:32:52.3059327Z         if scale_ub is not None:
2025-05-07T20:32:52.3059643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3060034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3060398Z             )
2025-05-07T20:32:52.3060626Z         else:
2025-05-07T20:32:52.3060876Z             scale_ub_tensor = None
2025-05-07T20:32:52.3061169Z     
2025-05-07T20:32:52.3061437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3061801Z             op = silu_mul_quant
2025-05-07T20:32:52.3062100Z             if compiled:
2025-05-07T20:32:52.3062388Z                 op = torch.compile(op)
2025-05-07T20:32:52.3062738Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3063058Z     
2025-05-07T20:32:52.3063288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3063480Z 
2025-05-07T20:32:52.3063596Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3063969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3064357Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3064686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3065324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3065965Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3066717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3067506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3068115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3068901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3069657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3070265Z     kernel = self.compile(
2025-05-07T20:32:52.3070885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3071634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3072094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3072358Z 
2025-05-07T20:32:52.3072595Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf79c4e20>
2025-05-07T20:32:52.3074026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3075645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf752b1c0>}
2025-05-07T20:32:52.3077211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3078377Z context = <triton._C.libtriton.ir.context object at 0x7efdf71486b0>
2025-05-07T20:32:52.3078716Z 
2025-05-07T20:32:52.3078908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3079513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3080100Z                            module_map=module_map)
2025-05-07T20:32:52.3080519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3080934Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3081238Z E       ^
2025-05-07T20:32:52.3081773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3082342Z 
2025-05-07T20:32:52.3082817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3083411Z 
2025-05-07T20:32:52.3083533Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3084019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3084477Z     T=4096,
2025-05-07T20:32:52.3084703Z     D=7168,
2025-05-07T20:32:52.3084936Z     scale_ub=None,
2025-05-07T20:32:52.3085190Z     contiguous=False,
2025-05-07T20:32:52.3085465Z     compiled=True,
2025-05-07T20:32:52.3085711Z )
2025-05-07T20:32:52.3086079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3086657Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3086975Z 
2025-05-07T20:32:52.3087074Z     @given(
2025-05-07T20:32:52.3087352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3087720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3088084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3088478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3088860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3089200Z     )
2025-05-07T20:32:52.3089610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3090127Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3090414Z         self,
2025-05-07T20:32:52.3090647Z         T: int,
2025-05-07T20:32:52.3090888Z         D: int,
2025-05-07T20:32:52.3091142Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3091465Z         contiguous: bool,
2025-05-07T20:32:52.3091753Z         compiled: bool,
2025-05-07T20:32:52.3092021Z     ) -> None:
2025-05-07T20:32:52.3092280Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3092566Z     
2025-05-07T20:32:52.3092878Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3093286Z     
2025-05-07T20:32:52.3093520Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3093862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3094229Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3094518Z         x0 = x[:, :D]
2025-05-07T20:32:52.3094771Z         x1 = x[:, D:]
2025-05-07T20:32:52.3095028Z     
2025-05-07T20:32:52.3095256Z         if contiguous:
2025-05-07T20:32:52.3095596Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3095979Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3096339Z     
2025-05-07T20:32:52.3096622Z         if scale_ub is not None:
2025-05-07T20:32:52.3097025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3097566Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3097932Z             )
2025-05-07T20:32:52.3098156Z         else:
2025-05-07T20:32:52.3098409Z             scale_ub_tensor = None
2025-05-07T20:32:52.3098708Z     
2025-05-07T20:32:52.3098977Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3099346Z             op = silu_mul_quant
2025-05-07T20:32:52.3099647Z             if compiled:
2025-05-07T20:32:52.3099938Z                 op = torch.compile(op)
2025-05-07T20:32:52.3100288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3100612Z     
2025-05-07T20:32:52.3100840Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3101041Z 
2025-05-07T20:32:52.3101158Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3101511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3101971Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3102298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3102951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3103610Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3104365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3105211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3105888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3106675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3107431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3108053Z     kernel = self.compile(
2025-05-07T20:32:52.3108683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3109447Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3109899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3110173Z 
2025-05-07T20:32:52.3110415Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf710c4c0>
2025-05-07T20:32:52.3111655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3113220Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a81f0>}
2025-05-07T20:32:52.3114797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3122876Z context = <triton._C.libtriton.ir.context object at 0x7efdf7163830>
2025-05-07T20:32:52.3123375Z 
2025-05-07T20:32:52.3123647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3124706Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3125451Z                            module_map=module_map)
2025-05-07T20:32:52.3126008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3126554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3126863Z E       ^
2025-05-07T20:32:52.3127401Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3127925Z 
2025-05-07T20:32:52.3128414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3129007Z 
2025-05-07T20:32:52.6911703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6912430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6913076Z     T=16384,
2025-05-07T20:32:52.6913392Z     D=5120,
2025-05-07T20:32:52.6913774Z     scale_ub=1200.0,
2025-05-07T20:32:52.6914033Z     contiguous=False,
2025-05-07T20:32:52.6914295Z     compiled=False,
2025-05-07T20:32:52.6914533Z )
2025-05-07T20:32:52.6914900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.6915464Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.6915819Z 
2025-05-07T20:32:52.6915917Z     @given(
2025-05-07T20:32:52.6916187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.6916548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.6916998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.6917370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.6917752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.6918086Z     )
2025-05-07T20:32:52.6918481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.6918984Z     def test_silu_mul_quant(
2025-05-07T20:32:52.6919341Z         self,
2025-05-07T20:32:52.6919565Z         T: int,
2025-05-07T20:32:52.6919794Z         D: int,
2025-05-07T20:32:52.6920046Z         scale_ub: Optional[float],
2025-05-07T20:32:52.6920353Z         contiguous: bool,
2025-05-07T20:32:52.6920629Z         compiled: bool,
2025-05-07T20:32:52.6920889Z     ) -> None:
2025-05-07T20:32:52.6921134Z         torch.manual_seed(2025)
2025-05-07T20:32:52.6921413Z     
2025-05-07T20:32:52.6921728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.6922118Z     
2025-05-07T20:32:52.6922343Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.6922678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.6923029Z         x = x_sign * x_clamp
2025-05-07T20:32:52.6923306Z         x0 = x[:, :D]
2025-05-07T20:32:52.6923557Z         x1 = x[:, D:]
2025-05-07T20:32:52.6924084Z     
2025-05-07T20:32:52.6924301Z         if contiguous:
2025-05-07T20:32:52.6924571Z             x0 = x0.contiguous()
2025-05-07T20:32:52.6924877Z             x1 = x1.contiguous()
2025-05-07T20:32:52.6925153Z     
2025-05-07T20:32:52.6925380Z         if scale_ub is not None:
2025-05-07T20:32:52.6925696Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.6926080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.6926437Z             )
2025-05-07T20:32:52.6926663Z         else:
2025-05-07T20:32:52.6926901Z             scale_ub_tensor = None
2025-05-07T20:32:52.6927191Z     
2025-05-07T20:32:52.6927460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.6927821Z             op = silu_mul_quant
2025-05-07T20:32:52.6928116Z             if compiled:
2025-05-07T20:32:52.6928405Z                 op = torch.compile(op)
2025-05-07T20:32:52.6928740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6929054Z     
2025-05-07T20:32:52.6929278Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.6929466Z 
2025-05-07T20:32:52.6929592Z moe/activation_test.py:117: 
2025-05-07T20:32:52.6929928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6930310Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.6930636Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6931415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.6932190Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.6932800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.6933579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.6934459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.6935068Z     kernel = self.compile(
2025-05-07T20:32:52.6935677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.6936414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.6936868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6937131Z 
2025-05-07T20:32:52.6937368Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf714cb50>
2025-05-07T20:32:52.6938582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.6940183Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a8700>}
2025-05-07T20:32:52.6941690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.6942903Z context = <triton._C.libtriton.ir.context object at 0x7efdf72d4d70>
2025-05-07T20:32:52.6943227Z 
2025-05-07T20:32:52.6943423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6944010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6944537Z                            module_map=module_map)
2025-05-07T20:32:52.6944961Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6945368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6945664Z E       ^
2025-05-07T20:32:52.6946201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6946712Z 
2025-05-07T20:32:52.6947188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6947766Z 
2025-05-07T20:32:52.6947896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.6948363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.6948818Z     T=16384,
2025-05-07T20:32:52.6949048Z     D=5120,
2025-05-07T20:32:52.6949267Z     scale_ub=1200.0,
2025-05-07T20:32:52.6949524Z     contiguous=True,
2025-05-07T20:32:52.6949779Z     compiled=True,
2025-05-07T20:32:52.6950009Z )
2025-05-07T20:32:52.6950371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.6950934Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.6951244Z 
2025-05-07T20:32:52.6951338Z     @given(
2025-05-07T20:32:52.6951604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.6951968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.6952319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.6952699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.6953079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.6953413Z     )
2025-05-07T20:32:52.6953897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.6954400Z     def test_silu_mul_quant(
2025-05-07T20:32:52.6954679Z         self,
2025-05-07T20:32:52.6954904Z         T: int,
2025-05-07T20:32:52.6955134Z         D: int,
2025-05-07T20:32:52.6955389Z         scale_ub: Optional[float],
2025-05-07T20:32:52.6955702Z         contiguous: bool,
2025-05-07T20:32:52.6955981Z         compiled: bool,
2025-05-07T20:32:52.6956243Z     ) -> None:
2025-05-07T20:32:52.6956494Z         torch.manual_seed(2025)
2025-05-07T20:32:52.6956860Z     
2025-05-07T20:32:52.6957179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.6957572Z     
2025-05-07T20:32:52.6957792Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.6958134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.6958487Z         x = x_sign * x_clamp
2025-05-07T20:32:52.6958761Z         x0 = x[:, :D]
2025-05-07T20:32:52.6959015Z         x1 = x[:, D:]
2025-05-07T20:32:52.6959259Z     
2025-05-07T20:32:52.6959476Z         if contiguous:
2025-05-07T20:32:52.6959748Z             x0 = x0.contiguous()
2025-05-07T20:32:52.6960055Z             x1 = x1.contiguous()
2025-05-07T20:32:52.6960336Z     
2025-05-07T20:32:52.6960565Z         if scale_ub is not None:
2025-05-07T20:32:52.6960883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.6961313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.6961673Z             )
2025-05-07T20:32:52.6961898Z         else:
2025-05-07T20:32:52.6962150Z             scale_ub_tensor = None
2025-05-07T20:32:52.6962436Z     
2025-05-07T20:32:52.6962702Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.6963064Z             op = silu_mul_quant
2025-05-07T20:32:52.6963394Z             if compiled:
2025-05-07T20:32:52.6963679Z                 op = torch.compile(op)
2025-05-07T20:32:52.6964013Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6964320Z     
2025-05-07T20:32:52.6964544Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.6964735Z 
2025-05-07T20:32:52.6964854Z moe/activation_test.py:117: 
2025-05-07T20:32:52.6965187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6965572Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.6965905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.6966542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.6967182Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.6967926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.6968704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.6969305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.6970073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.6970821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.6971420Z     kernel = self.compile(
2025-05-07T20:32:52.6972026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.6972769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.6973226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.6973486Z 
2025-05-07T20:32:52.6973723Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7235720>
2025-05-07T20:32:52.6974936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.6976522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a97e0>}
2025-05-07T20:32:52.6978029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.6979183Z context = <triton._C.libtriton.ir.context object at 0x7efdf7249af0>
2025-05-07T20:32:52.6979509Z 
2025-05-07T20:32:52.6979786Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.6980377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.6980911Z                            module_map=module_map)
2025-05-07T20:32:52.6981326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.6981725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.6982025Z E       ^
2025-05-07T20:32:52.6982552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.6983058Z 
2025-05-07T20:32:52.6983525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.6984182Z 
2025-05-07T20:32:52.9044283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.9044941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.9045558Z     T=16384,
2025-05-07T20:32:52.9045969Z     D=5120,
2025-05-07T20:32:52.9046289Z     scale_ub=None,
2025-05-07T20:32:52.9046625Z     contiguous=False,
2025-05-07T20:32:52.9046985Z     compiled=True,
2025-05-07T20:32:52.9047402Z )
2025-05-07T20:32:52.9047767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.9048334Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.9048656Z 
2025-05-07T20:32:52.9048750Z     @given(
2025-05-07T20:32:52.9049021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.9049379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.9049736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.9050127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.9050508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.9050841Z     )
2025-05-07T20:32:52.9051253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.9051755Z     def test_silu_mul_quant(
2025-05-07T20:32:52.9052039Z         self,
2025-05-07T20:32:52.9052267Z         T: int,
2025-05-07T20:32:52.9052503Z         D: int,
2025-05-07T20:32:52.9052761Z         scale_ub: Optional[float],
2025-05-07T20:32:52.9053081Z         contiguous: bool,
2025-05-07T20:32:52.9053367Z         compiled: bool,
2025-05-07T20:32:52.9053624Z     ) -> None:
2025-05-07T20:32:52.9053877Z         torch.manual_seed(2025)
2025-05-07T20:32:52.9054160Z     
2025-05-07T20:32:52.9054472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.9054864Z     
2025-05-07T20:32:52.9055092Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.9055429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.9055841Z         x = x_sign * x_clamp
2025-05-07T20:32:52.9056124Z         x0 = x[:, :D]
2025-05-07T20:32:52.9056375Z         x1 = x[:, D:]
2025-05-07T20:32:52.9056624Z     
2025-05-07T20:32:52.9056849Z         if contiguous:
2025-05-07T20:32:52.9057117Z             x0 = x0.contiguous()
2025-05-07T20:32:52.9057423Z             x1 = x1.contiguous()
2025-05-07T20:32:52.9057707Z     
2025-05-07T20:32:52.9057930Z         if scale_ub is not None:
2025-05-07T20:32:52.9058249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.9058640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.9058995Z             )
2025-05-07T20:32:52.9059225Z         else:
2025-05-07T20:32:52.9059474Z             scale_ub_tensor = None
2025-05-07T20:32:52.9059770Z     
2025-05-07T20:32:52.9060038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.9060402Z             op = silu_mul_quant
2025-05-07T20:32:52.9060695Z             if compiled:
2025-05-07T20:32:52.9060982Z                 op = torch.compile(op)
2025-05-07T20:32:52.9061326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9061643Z     
2025-05-07T20:32:52.9062001Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.9062200Z 
2025-05-07T20:32:52.9062316Z moe/activation_test.py:117: 
2025-05-07T20:32:52.9062664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9063045Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.9063372Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.9064011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.9064652Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.9065401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.9066185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.9066868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.9067643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.9068401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.9069011Z     kernel = self.compile(
2025-05-07T20:32:52.9069684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.9070429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.9070887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.9071149Z 
2025-05-07T20:32:52.9071392Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf72ccdc0>
2025-05-07T20:32:52.9072615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.9074262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71aa680>}
2025-05-07T20:32:52.9075789Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.9076955Z context = <triton._C.libtriton.ir.context object at 0x7efdf70cfdb0>
2025-05-07T20:32:52.9077283Z 
2025-05-07T20:32:52.9077481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.9078067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.9078607Z                            module_map=module_map)
2025-05-07T20:32:52.9079031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.9079436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.9079736Z E       ^
2025-05-07T20:32:52.9080269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.9080780Z 
2025-05-07T20:32:52.9081269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.9081856Z 
2025-05-07T20:32:52.9081985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.9082455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.9082918Z     T=2048,
2025-05-07T20:32:52.9083142Z     D=5120,
2025-05-07T20:32:52.9083365Z     scale_ub=None,
2025-05-07T20:32:52.9083624Z     contiguous=False,
2025-05-07T20:32:52.9083895Z     compiled=True,
2025-05-07T20:32:52.9084128Z )
2025-05-07T20:32:53.0219930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0220588Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:53.0221259Z 
2025-05-07T20:32:53.0221417Z     @given(
2025-05-07T20:32:53.0221830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0222335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0222707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0223094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0223479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0224106Z     )
2025-05-07T20:32:53.0224519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0225029Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0225306Z         self,
2025-05-07T20:32:53.0225559Z         T: int,
2025-05-07T20:32:53.0225819Z         D: int,
2025-05-07T20:32:53.0226163Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0226480Z         contiguous: bool,
2025-05-07T20:32:53.0226764Z         compiled: bool,
2025-05-07T20:32:53.0227021Z     ) -> None:
2025-05-07T20:32:53.0227285Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0227571Z     
2025-05-07T20:32:53.0227884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0228282Z     
2025-05-07T20:32:53.0228590Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0228929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0229284Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0229566Z         x0 = x[:, :D]
2025-05-07T20:32:53.0229824Z         x1 = x[:, D:]
2025-05-07T20:32:53.0230065Z     
2025-05-07T20:32:53.0230286Z         if contiguous:
2025-05-07T20:32:53.0230563Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0230860Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0231143Z     
2025-05-07T20:32:53.0231377Z         if scale_ub is not None:
2025-05-07T20:32:53.0231700Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0232096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0232463Z             )
2025-05-07T20:32:53.0232689Z         else:
2025-05-07T20:32:53.0232941Z             scale_ub_tensor = None
2025-05-07T20:32:53.0233237Z     
2025-05-07T20:32:53.0233586Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0233959Z             op = silu_mul_quant
2025-05-07T20:32:53.0234259Z             if compiled:
2025-05-07T20:32:53.0234556Z                 op = torch.compile(op)
2025-05-07T20:32:53.0234900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0235223Z     
2025-05-07T20:32:53.0235457Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0235650Z 
2025-05-07T20:32:53.0235768Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0236118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0236509Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0236835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0237486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0238132Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0238896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0239684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0240304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0241090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0241843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0242454Z     kernel = self.compile(
2025-05-07T20:32:53.0243083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0243843Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0244436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0244710Z 
2025-05-07T20:32:53.0244949Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7950970>
2025-05-07T20:32:53.0246181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0247753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71aa560>}
2025-05-07T20:32:53.0249281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0250746Z context = <triton._C.libtriton.ir.context object at 0x7efdf70a6b70>
2025-05-07T20:32:53.0251135Z 
2025-05-07T20:32:53.0251344Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0252036Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0252695Z                            module_map=module_map)
2025-05-07T20:32:53.0253163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0253613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0253936Z E       ^
2025-05-07T20:32:53.0254544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0255166Z 
2025-05-07T20:32:53.0255758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0256489Z 
2025-05-07T20:32:53.0256624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0257165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0257683Z     T=2048,
2025-05-07T20:32:53.0257911Z     D=5120,
2025-05-07T20:32:53.0258148Z     scale_ub=1200.0,
2025-05-07T20:32:53.0258421Z     contiguous=False,
2025-05-07T20:32:53.0258696Z     compiled=True,
2025-05-07T20:32:53.0258947Z )
2025-05-07T20:32:53.0259348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0260000Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.0260360Z 
2025-05-07T20:32:53.0260460Z     @given(
2025-05-07T20:32:53.0260737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0261137Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0261525Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0261950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0262366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0262739Z     )
2025-05-07T20:32:53.0263151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0263659Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0263951Z         self,
2025-05-07T20:32:53.0264186Z         T: int,
2025-05-07T20:32:53.0264416Z         D: int,
2025-05-07T20:32:53.0264676Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0264996Z         contiguous: bool,
2025-05-07T20:32:53.0265304Z         compiled: bool,
2025-05-07T20:32:53.0265633Z     ) -> None:
2025-05-07T20:32:53.0265956Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0266303Z     
2025-05-07T20:32:53.0266704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0267202Z     
2025-05-07T20:32:53.0274728Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0275124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0275547Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0275905Z         x0 = x[:, :D]
2025-05-07T20:32:53.0276392Z         x1 = x[:, D:]
2025-05-07T20:32:53.0276704Z     
2025-05-07T20:32:53.0276988Z         if contiguous:
2025-05-07T20:32:53.0277282Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0277592Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0277881Z     
2025-05-07T20:32:53.0278104Z         if scale_ub is not None:
2025-05-07T20:32:53.0278431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0278845Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0279213Z             )
2025-05-07T20:32:53.0279437Z         else:
2025-05-07T20:32:53.0279690Z             scale_ub_tensor = None
2025-05-07T20:32:53.0279988Z     
2025-05-07T20:32:53.0280255Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0280677Z             op = silu_mul_quant
2025-05-07T20:32:53.0280969Z             if compiled:
2025-05-07T20:32:53.0281260Z                 op = torch.compile(op)
2025-05-07T20:32:53.0281615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0281940Z     
2025-05-07T20:32:53.0282165Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0282365Z 
2025-05-07T20:32:53.0282482Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0282881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0283270Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0283594Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0284245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.0284890Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.0285644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0286438Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0287049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0287837Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0288601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0289216Z     kernel = self.compile(
2025-05-07T20:32:53.0289840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0290597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0291059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0291323Z 
2025-05-07T20:32:53.0291567Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf72f82b0>
2025-05-07T20:32:53.0292811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0294379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71ab370>}
2025-05-07T20:32:53.0295921Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0297095Z context = <triton._C.libtriton.ir.context object at 0x7efdf6f8d2f0>
2025-05-07T20:32:53.0297425Z 
2025-05-07T20:32:53.0297618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0298215Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0298760Z                            module_map=module_map)
2025-05-07T20:32:53.0299176Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0299678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0299987Z E       ^
2025-05-07T20:32:53.0300524Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0301043Z 
2025-05-07T20:32:53.0301520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0302112Z 
2025-05-07T20:32:53.2373174Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2373804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2374422Z     T=4096,
2025-05-07T20:32:53.2374737Z     D=5120,
2025-05-07T20:32:53.2375037Z     scale_ub=1200.0,
2025-05-07T20:32:53.2375391Z     contiguous=True,
2025-05-07T20:32:53.2375882Z     compiled=True,
2025-05-07T20:32:53.2376186Z )
2025-05-07T20:32:53.2376659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.2377319Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.2377621Z 
2025-05-07T20:32:53.2377710Z     @given(
2025-05-07T20:32:53.2377974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.2378424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.2378779Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.2379150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.2379530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.2379855Z     )
2025-05-07T20:32:53.2380245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.2380740Z     def test_silu_mul_quant(
2025-05-07T20:32:53.2381018Z         self,
2025-05-07T20:32:53.2381238Z         T: int,
2025-05-07T20:32:53.2381467Z         D: int,
2025-05-07T20:32:53.2381716Z         scale_ub: Optional[float],
2025-05-07T20:32:53.2382021Z         contiguous: bool,
2025-05-07T20:32:53.2382300Z         compiled: bool,
2025-05-07T20:32:53.2382558Z     ) -> None:
2025-05-07T20:32:53.2382798Z         torch.manual_seed(2025)
2025-05-07T20:32:53.2383076Z     
2025-05-07T20:32:53.2383386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.2383778Z     
2025-05-07T20:32:53.2383995Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.2384324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.2384674Z         x = x_sign * x_clamp
2025-05-07T20:32:53.2384945Z         x0 = x[:, :D]
2025-05-07T20:32:53.2385195Z         x1 = x[:, D:]
2025-05-07T20:32:53.2385434Z     
2025-05-07T20:32:53.2385647Z         if contiguous:
2025-05-07T20:32:53.2385914Z             x0 = x0.contiguous()
2025-05-07T20:32:53.2386210Z             x1 = x1.contiguous()
2025-05-07T20:32:53.2386482Z     
2025-05-07T20:32:53.2386703Z         if scale_ub is not None:
2025-05-07T20:32:53.2387015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.2387392Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.2387743Z             )
2025-05-07T20:32:53.2387963Z         else:
2025-05-07T20:32:53.2388198Z             scale_ub_tensor = None
2025-05-07T20:32:53.2388484Z     
2025-05-07T20:32:53.2388752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.2389102Z             op = silu_mul_quant
2025-05-07T20:32:53.2389383Z             if compiled:
2025-05-07T20:32:53.2389665Z                 op = torch.compile(op)
2025-05-07T20:32:53.2390002Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.2390312Z     
2025-05-07T20:32:53.2390538Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.2390725Z 
2025-05-07T20:32:53.2390842Z moe/activation_test.py:117: 
2025-05-07T20:32:53.2391175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2391553Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.2391875Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.2392625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.2393253Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.2394090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.2394864Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.2395463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.2396225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.2396970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.2397619Z     kernel = self.compile(
2025-05-07T20:32:53.2398221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.2398957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.2399404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.2399659Z 
2025-05-07T20:32:53.2399939Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf70d2aa0>
2025-05-07T20:32:53.2401140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.2402660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f00310>}
2025-05-07T20:32:53.2404157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.2405295Z context = <triton._C.libtriton.ir.context object at 0x7efdf6f201b0>
2025-05-07T20:32:53.2405620Z 
2025-05-07T20:32:53.2405812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.2406399Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.2406926Z                            module_map=module_map)
2025-05-07T20:32:53.2407337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.2407728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.2408025Z E       ^
2025-05-07T20:32:53.2408548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.2409052Z 
2025-05-07T20:32:53.2409531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.2410103Z 
2025-05-07T20:32:53.2410228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.2410697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.2411150Z     T=128,
2025-05-07T20:32:53.2411358Z     D=5120,
2025-05-07T20:32:53.2411582Z     scale_ub=1200.0,
2025-05-07T20:32:53.2411840Z     contiguous=False,
2025-05-07T20:32:53.2412092Z     compiled=True,
2025-05-07T20:32:53.2412326Z )
2025-05-07T20:32:53.5561307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5561987Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.5562493Z 
2025-05-07T20:32:53.5562627Z     @given(
2025-05-07T20:32:53.5562983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5563469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5563836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5564204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5564754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5565080Z     )
2025-05-07T20:32:53.5565470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5565968Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5566244Z         self,
2025-05-07T20:32:53.5566462Z         T: int,
2025-05-07T20:32:53.5566685Z         D: int,
2025-05-07T20:32:53.5566934Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5567239Z         contiguous: bool,
2025-05-07T20:32:53.5567510Z         compiled: bool,
2025-05-07T20:32:53.5567763Z     ) -> None:
2025-05-07T20:32:53.5568006Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5568279Z     
2025-05-07T20:32:53.5568579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5569060Z     
2025-05-07T20:32:53.5569282Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5569601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5569954Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5570225Z         x0 = x[:, :D]
2025-05-07T20:32:53.5570463Z         x1 = x[:, D:]
2025-05-07T20:32:53.5570698Z     
2025-05-07T20:32:53.5570907Z         if contiguous:
2025-05-07T20:32:53.5571237Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5571523Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5571794Z     
2025-05-07T20:32:53.5572005Z         if scale_ub is not None:
2025-05-07T20:32:53.5572311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5572687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5573032Z             )
2025-05-07T20:32:53.5573248Z         else:
2025-05-07T20:32:53.5573492Z             scale_ub_tensor = None
2025-05-07T20:32:53.5573779Z     
2025-05-07T20:32:53.5574040Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5574397Z             op = silu_mul_quant
2025-05-07T20:32:53.5574682Z             if compiled:
2025-05-07T20:32:53.5574963Z                 op = torch.compile(op)
2025-05-07T20:32:53.5575302Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5575611Z     
2025-05-07T20:32:53.5575827Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5576018Z 
2025-05-07T20:32:53.5576136Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5576469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5576844Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5577158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5577787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.5578409Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.5579136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5579902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5580505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5581263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5581995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5582594Z     kernel = self.compile(
2025-05-07T20:32:53.5583199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5583927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5584370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5584631Z 
2025-05-07T20:32:53.5584862Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf70d3eb0>
2025-05-07T20:32:53.5586362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5588189Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f01090>}
2025-05-07T20:32:53.5589686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5590830Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c62930>
2025-05-07T20:32:53.5591152Z 
2025-05-07T20:32:53.5591350Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5591934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5592504Z                            module_map=module_map)
2025-05-07T20:32:53.5592923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5593323Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5593712Z E       ^
2025-05-07T20:32:53.5594237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5594789Z 
2025-05-07T20:32:53.5595263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5595833Z 
2025-05-07T20:32:53.5595958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.5596419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.5596872Z     T=16384,
2025-05-07T20:32:53.5597098Z     D=7168,
2025-05-07T20:32:53.5597323Z     scale_ub=1200.0,
2025-05-07T20:32:53.5597586Z     contiguous=True,
2025-05-07T20:32:53.5597847Z     compiled=True,
2025-05-07T20:32:53.5598080Z )
2025-05-07T20:32:53.5598453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.5599013Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:53.5599321Z 
2025-05-07T20:32:53.5599416Z     @given(
2025-05-07T20:32:53.5599678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.5600034Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.5600384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.5600753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.5601127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.5601457Z     )
2025-05-07T20:32:53.5601847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.5602346Z     def test_silu_mul_quant(
2025-05-07T20:32:53.5602625Z         self,
2025-05-07T20:32:53.5602847Z         T: int,
2025-05-07T20:32:53.5603076Z         D: int,
2025-05-07T20:32:53.5603327Z         scale_ub: Optional[float],
2025-05-07T20:32:53.5603638Z         contiguous: bool,
2025-05-07T20:32:53.5603908Z         compiled: bool,
2025-05-07T20:32:53.5604161Z     ) -> None:
2025-05-07T20:32:53.5604409Z         torch.manual_seed(2025)
2025-05-07T20:32:53.5604682Z     
2025-05-07T20:32:53.5604994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.5605384Z     
2025-05-07T20:32:53.5605603Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.5605932Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.5606283Z         x = x_sign * x_clamp
2025-05-07T20:32:53.5606552Z         x0 = x[:, :D]
2025-05-07T20:32:53.5606801Z         x1 = x[:, D:]
2025-05-07T20:32:53.5607040Z     
2025-05-07T20:32:53.5607249Z         if contiguous:
2025-05-07T20:32:53.5607514Z             x0 = x0.contiguous()
2025-05-07T20:32:53.5607813Z             x1 = x1.contiguous()
2025-05-07T20:32:53.5608082Z     
2025-05-07T20:32:53.5608305Z         if scale_ub is not None:
2025-05-07T20:32:53.5608709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.5609087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.5609438Z             )
2025-05-07T20:32:53.5609664Z         else:
2025-05-07T20:32:53.5609910Z             scale_ub_tensor = None
2025-05-07T20:32:53.5610194Z     
2025-05-07T20:32:53.5610462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.5610817Z             op = silu_mul_quant
2025-05-07T20:32:53.5611102Z             if compiled:
2025-05-07T20:32:53.5611386Z                 op = torch.compile(op)
2025-05-07T20:32:53.5611727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5612037Z     
2025-05-07T20:32:53.5612260Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.5612448Z 
2025-05-07T20:32:53.5612568Z moe/activation_test.py:117: 
2025-05-07T20:32:53.5612951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5613327Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.5613659Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.5614295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.5614918Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.5615707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.5616529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.5617125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.5617887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.5618634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.5619236Z     kernel = self.compile(
2025-05-07T20:32:53.5619841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.5620573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5621019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.5621281Z 
2025-05-07T20:32:53.5621520Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f6ecb0>
2025-05-07T20:32:53.5622709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.5624433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f02290>}
2025-05-07T20:32:53.5625937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.5627079Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c111b0>
2025-05-07T20:32:53.5627403Z 
2025-05-07T20:32:53.5627595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.5628182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5628711Z                            module_map=module_map)
2025-05-07T20:32:53.5629124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5629517Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5629812Z E       ^
2025-05-07T20:32:53.5630336Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5630840Z 
2025-05-07T20:32:53.5631436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.5632013Z 
2025-05-07T20:32:53.7104040Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7105606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7106529Z     T=16384,
2025-05-07T20:32:53.7106904Z     D=5120,
2025-05-07T20:32:53.7107259Z     scale_ub=1200.0,
2025-05-07T20:32:53.7107665Z     contiguous=True,
2025-05-07T20:32:53.7108057Z     compiled=False,
2025-05-07T20:32:53.7108437Z )
2025-05-07T20:32:53.7109043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7110014Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:53.7110572Z 
2025-05-07T20:32:53.7110722Z     @given(
2025-05-07T20:32:53.7111164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7112133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7112723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7113388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7114205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7114762Z     )
2025-05-07T20:32:53.7115453Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7116537Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7117002Z         self,
2025-05-07T20:32:53.7117381Z         T: int,
2025-05-07T20:32:53.7117771Z         D: int,
2025-05-07T20:32:53.7118179Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7118707Z         contiguous: bool,
2025-05-07T20:32:53.7119166Z         compiled: bool,
2025-05-07T20:32:53.7119604Z     ) -> None:
2025-05-07T20:32:53.7120013Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7120482Z     
2025-05-07T20:32:53.7121007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7121670Z     
2025-05-07T20:32:53.7122045Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.7122607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.7123202Z         x = x_sign * x_clamp
2025-05-07T20:32:53.7123674Z         x0 = x[:, :D]
2025-05-07T20:32:53.7124505Z         x1 = x[:, D:]
2025-05-07T20:32:53.7124903Z     
2025-05-07T20:32:53.7125270Z         if contiguous:
2025-05-07T20:32:53.7125714Z             x0 = x0.contiguous()
2025-05-07T20:32:53.7126208Z             x1 = x1.contiguous()
2025-05-07T20:32:53.7126677Z     
2025-05-07T20:32:53.7127051Z         if scale_ub is not None:
2025-05-07T20:32:53.7127573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.7128282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.7128878Z             )
2025-05-07T20:32:53.7129260Z         else:
2025-05-07T20:32:53.7129660Z             scale_ub_tensor = None
2025-05-07T20:32:53.7130139Z     
2025-05-07T20:32:53.7130581Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.7131185Z             op = silu_mul_quant
2025-05-07T20:32:53.7131668Z             if compiled:
2025-05-07T20:32:53.7132123Z                 op = torch.compile(op)
2025-05-07T20:32:53.7132679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7133234Z     
2025-05-07T20:32:53.7133596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.7133917Z 
2025-05-07T20:32:53.7134100Z moe/activation_test.py:117: 
2025-05-07T20:32:53.7134661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7135288Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.7135824Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7137149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.7138462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.7139538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.7141179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.7142533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.7143616Z     kernel = self.compile(
2025-05-07T20:32:53.7144727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.7146045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.7146830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7163026Z 
2025-05-07T20:32:53.7163488Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f94d30>
2025-05-07T20:32:53.7165735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.7168692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f011b0>}
2025-05-07T20:32:53.7171395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.7173551Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c4a5b0>
2025-05-07T20:32:53.7174076Z 
2025-05-07T20:32:53.7174341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.7175154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.7175866Z                            module_map=module_map)
2025-05-07T20:32:53.7176438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.7176978Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.7177380Z E       ^
2025-05-07T20:32:53.7178088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.7178831Z 
2025-05-07T20:32:53.7179544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.7180400Z 
2025-05-07T20:32:53.7180580Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.7181273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.7181940Z     T=1,
2025-05-07T20:32:53.7182260Z     D=7168,
2025-05-07T20:32:53.7182598Z     scale_ub=1200.0,
2025-05-07T20:32:53.7182990Z     contiguous=False,
2025-05-07T20:32:53.7183380Z     compiled=False,
2025-05-07T20:32:53.7183733Z )
2025-05-07T20:32:53.7184259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.7185117Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:53.7185572Z 
2025-05-07T20:32:53.7185723Z     @given(
2025-05-07T20:32:53.7186133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.7186656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.7187232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.7187835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.7188447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.7188982Z     )
2025-05-07T20:32:53.7189620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.7190441Z     def test_silu_mul_quant(
2025-05-07T20:32:53.7190893Z         self,
2025-05-07T20:32:53.7191249Z         T: int,
2025-05-07T20:32:53.7191620Z         D: int,
2025-05-07T20:32:53.7192031Z         scale_ub: Optional[float],
2025-05-07T20:32:53.7192520Z         contiguous: bool,
2025-05-07T20:32:53.7192961Z         compiled: bool,
2025-05-07T20:32:53.7193376Z     ) -> None:
2025-05-07T20:32:53.7194093Z         torch.manual_seed(2025)
2025-05-07T20:32:53.7194559Z     
2025-05-07T20:32:53.7195067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.7195703Z     
2025-05-07T20:32:53.7196057Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.7196592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.7197182Z         x = x_sign * x_clamp
2025-05-07T20:32:53.7197621Z         x0 = x[:, :D]
2025-05-07T20:32:53.7198027Z         x1 = x[:, D:]
2025-05-07T20:32:53.7198425Z     
2025-05-07T20:32:53.7198765Z         if contiguous:
2025-05-07T20:32:53.7199202Z             x0 = x0.contiguous()
2025-05-07T20:32:53.7199687Z             x1 = x1.contiguous()
2025-05-07T20:32:53.7200128Z     
2025-05-07T20:32:53.7200581Z         if scale_ub is not None:
2025-05-07T20:32:53.7201094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.7201701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.7202281Z             )
2025-05-07T20:32:53.7202641Z         else:
2025-05-07T20:32:53.7202994Z             scale_ub_tensor = None
2025-05-07T20:32:53.7203448Z     
2025-05-07T20:32:53.7203873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.7204540Z             op = silu_mul_quant
2025-05-07T20:32:53.7204978Z             if compiled:
2025-05-07T20:32:53.7205437Z                 op = torch.compile(op)
2025-05-07T20:32:53.7205994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7206516Z     
2025-05-07T20:32:53.7206886Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.7207209Z 
2025-05-07T20:32:53.7207401Z moe/activation_test.py:117: 
2025-05-07T20:32:53.7207962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7208604Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.7209139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.7210420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.7211724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.7212695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.7214007Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.7215314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.7216383Z     kernel = self.compile(
2025-05-07T20:32:53.7217456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.7218760Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.7219537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.7220000Z 
2025-05-07T20:32:53.7220404Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f94700>
2025-05-07T20:32:53.7222566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.7225830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f02680>}
2025-05-07T20:32:53.7228551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.7230608Z context = <triton._C.libtriton.ir.context object at 0x7efdf6be2030>
2025-05-07T20:32:53.7231200Z 
2025-05-07T20:32:53.7231520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.7232781Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.7233801Z                            module_map=module_map)
2025-05-07T20:32:53.7234510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.7235210Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.7235707Z E       ^
2025-05-07T20:32:53.7236715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.7237647Z 
2025-05-07T20:32:53.7238482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.7239520Z 
2025-05-07T20:32:53.9301910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9303128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9303885Z     T=4096,
2025-05-07T20:32:53.9304228Z     D=7168,
2025-05-07T20:32:53.9304615Z     scale_ub=1200.0,
2025-05-07T20:32:53.9305027Z     contiguous=False,
2025-05-07T20:32:53.9305451Z     compiled=True,
2025-05-07T20:32:53.9305854Z )
2025-05-07T20:32:53.9306509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.9307498Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:53.9307949Z 
2025-05-07T20:32:53.9308094Z     @given(
2025-05-07T20:32:53.9308474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.9309003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.9309541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.9310140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.9310732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.9311249Z     )
2025-05-07T20:32:53.9311853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.9312616Z     def test_silu_mul_quant(
2025-05-07T20:32:53.9313065Z         self,
2025-05-07T20:32:53.9313429Z         T: int,
2025-05-07T20:32:53.9313939Z         D: int,
2025-05-07T20:32:53.9314371Z         scale_ub: Optional[float],
2025-05-07T20:32:53.9314863Z         contiguous: bool,
2025-05-07T20:32:53.9315290Z         compiled: bool,
2025-05-07T20:32:53.9315728Z     ) -> None:
2025-05-07T20:32:53.9316151Z         torch.manual_seed(2025)
2025-05-07T20:32:53.9316587Z     
2025-05-07T20:32:53.9317102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.9317755Z     
2025-05-07T20:32:53.9318114Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.9318661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.9319248Z         x = x_sign * x_clamp
2025-05-07T20:32:53.9319714Z         x0 = x[:, :D]
2025-05-07T20:32:53.9320109Z         x1 = x[:, D:]
2025-05-07T20:32:53.9320516Z     
2025-05-07T20:32:53.9320877Z         if contiguous:
2025-05-07T20:32:53.9321327Z             x0 = x0.contiguous()
2025-05-07T20:32:53.9321831Z             x1 = x1.contiguous()
2025-05-07T20:32:53.9322307Z     
2025-05-07T20:32:53.9322671Z         if scale_ub is not None:
2025-05-07T20:32:53.9323201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.9324106Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.9324714Z             )
2025-05-07T20:32:53.9325094Z         else:
2025-05-07T20:32:53.9325505Z             scale_ub_tensor = None
2025-05-07T20:32:53.9325992Z     
2025-05-07T20:32:53.9326457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.9327160Z             op = silu_mul_quant
2025-05-07T20:32:53.9327640Z             if compiled:
2025-05-07T20:32:53.9328115Z                 op = torch.compile(op)
2025-05-07T20:32:53.9328706Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9329249Z     
2025-05-07T20:32:53.9329608Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.9329945Z 
2025-05-07T20:32:53.9330414Z moe/activation_test.py:117: 
2025-05-07T20:32:53.9330994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9331626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.9332166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.9333243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:53.9334334Z     return fn(*args, **kwargs)
2025-05-07T20:32:53.9335642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.9337062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.9338126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.9339614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.9340967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.9341993Z     kernel = self.compile(
2025-05-07T20:32:53.9343001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.9344323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.9345087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.9345508Z 
2025-05-07T20:32:53.9345904Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ca8b80>
2025-05-07T20:32:53.9347927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.9350642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f03b50>}
2025-05-07T20:32:53.9353279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.9355352Z context = <triton._C.libtriton.ir.context object at 0x7efdf6b2a5b0>
2025-05-07T20:32:53.9355893Z 
2025-05-07T20:32:53.9356209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.9357246Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.9358163Z                            module_map=module_map)
2025-05-07T20:32:53.9358854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.9359520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.9360018Z E       ^
2025-05-07T20:32:53.9360929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.9361819Z 
2025-05-07T20:32:53.9362629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.9363631Z 
2025-05-07T20:32:53.9363830Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.9364623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.9365394Z     T=128,
2025-05-07T20:32:53.9365737Z     D=7168,
2025-05-07T20:32:53.9366154Z     scale_ub=1200.0,
2025-05-07T20:32:53.9366610Z     contiguous=False,
2025-05-07T20:32:53.9367023Z     compiled=True,
2025-05-07T20:32:53.9367415Z )
2025-05-07T20:32:54.0476098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0477168Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:54.0477718Z 
2025-05-07T20:32:54.0477867Z     @given(
2025-05-07T20:32:54.0478677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0479249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0479773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0480401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0481033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0481593Z     )
2025-05-07T20:32:54.0482272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0483131Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0483604Z         self,
2025-05-07T20:32:54.0483976Z         T: int,
2025-05-07T20:32:54.0484340Z         D: int,
2025-05-07T20:32:54.0484761Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0485286Z         contiguous: bool,
2025-05-07T20:32:54.0485875Z         compiled: bool,
2025-05-07T20:32:54.0486311Z     ) -> None:
2025-05-07T20:32:54.0486724Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0487208Z     
2025-05-07T20:32:54.0487739Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0488409Z     
2025-05-07T20:32:54.0488781Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0489328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0490061Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0490523Z         x0 = x[:, :D]
2025-05-07T20:32:54.0490930Z         x1 = x[:, D:]
2025-05-07T20:32:54.0491329Z     
2025-05-07T20:32:54.0491684Z         if contiguous:
2025-05-07T20:32:54.0492121Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0492623Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0493087Z     
2025-05-07T20:32:54.0493448Z         if scale_ub is not None:
2025-05-07T20:32:54.0493975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0494624Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0495210Z             )
2025-05-07T20:32:54.0495585Z         else:
2025-05-07T20:32:54.0496019Z             scale_ub_tensor = None
2025-05-07T20:32:54.0496527Z     
2025-05-07T20:32:54.0496953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0497559Z             op = silu_mul_quant
2025-05-07T20:32:54.0498039Z             if compiled:
2025-05-07T20:32:54.0498512Z                 op = torch.compile(op)
2025-05-07T20:32:54.0499087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0499618Z     
2025-05-07T20:32:54.0499977Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.0500301Z 
2025-05-07T20:32:54.0500488Z moe/activation_test.py:117: 
2025-05-07T20:32:54.0501071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0501702Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.0502240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0503304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.0504368Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.0505611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.0506963Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.0507986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0509265Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0510569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0511625Z     kernel = self.compile(
2025-05-07T20:32:54.0512690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0514137Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0514909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0515509Z 
2025-05-07T20:32:54.0515933Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6b003a0>
2025-05-07T20:32:54.0518086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0520851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e30280>}
2025-05-07T20:32:54.0523551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0525812Z context = <triton._C.libtriton.ir.context object at 0x7efdf6e21330>
2025-05-07T20:32:54.0526361Z 
2025-05-07T20:32:54.0526699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0527717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0528641Z                            module_map=module_map)
2025-05-07T20:32:54.0529480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0530162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.0530654Z E       ^
2025-05-07T20:32:54.0531570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0532476Z 
2025-05-07T20:32:54.0533310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.0534334Z 
2025-05-07T20:32:54.0534548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0535343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0536132Z     T=2048,
2025-05-07T20:32:54.0536503Z     D=7168,
2025-05-07T20:32:54.0536865Z     scale_ub=None,
2025-05-07T20:32:54.0537281Z     contiguous=True,
2025-05-07T20:32:54.0537708Z     compiled=True,
2025-05-07T20:32:54.0538097Z )
2025-05-07T20:32:54.0538716Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0539669Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.0540192Z 
2025-05-07T20:32:54.0540340Z     @given(
2025-05-07T20:32:54.0540805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0541421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0542017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0542651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0543298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0543855Z     )
2025-05-07T20:32:54.0544532Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0545404Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0545870Z         self,
2025-05-07T20:32:54.0546235Z         T: int,
2025-05-07T20:32:54.0546634Z         D: int,
2025-05-07T20:32:54.0547030Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0547471Z         contiguous: bool,
2025-05-07T20:32:54.0547841Z         compiled: bool,
2025-05-07T20:32:54.0548201Z     ) -> None:
2025-05-07T20:32:54.0548527Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0548910Z     
2025-05-07T20:32:54.0549342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0549868Z     
2025-05-07T20:32:54.0550184Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0550638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0551133Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0551491Z         x0 = x[:, :D]
2025-05-07T20:32:54.0551837Z         x1 = x[:, D:]
2025-05-07T20:32:54.0552188Z     
2025-05-07T20:32:54.0552718Z         if contiguous:
2025-05-07T20:32:54.0553149Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0553692Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0554098Z     
2025-05-07T20:32:54.0554421Z         if scale_ub is not None:
2025-05-07T20:32:54.0554897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0555443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0555940Z             )
2025-05-07T20:32:54.0556269Z         else:
2025-05-07T20:32:54.0556613Z             scale_ub_tensor = None
2025-05-07T20:32:54.0557060Z     
2025-05-07T20:32:54.0557447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0557972Z             op = silu_mul_quant
2025-05-07T20:32:54.0558401Z             if compiled:
2025-05-07T20:32:54.0559045Z                 op = torch.compile(op)
2025-05-07T20:32:54.0559533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0559950Z     
2025-05-07T20:32:54.0560267Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.0560565Z 
2025-05-07T20:32:54.0560734Z moe/activation_test.py:117: 
2025-05-07T20:32:54.0561184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0561716Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.0562255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0563154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:54.0564063Z     return fn(*args, **kwargs)
2025-05-07T20:32:54.0565108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.0566196Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.0567085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0568171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0569229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0570078Z     kernel = self.compile(
2025-05-07T20:32:54.0570922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0571978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0572612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0572977Z 
2025-05-07T20:32:54.0573300Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6e67700>
2025-05-07T20:32:54.0575007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0577222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e30dc0>}
2025-05-07T20:32:54.0579361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0580975Z context = <triton._C.libtriton.ir.context object at 0x7efdf6e97b70>
2025-05-07T20:32:54.0581435Z 
2025-05-07T20:32:54.0581697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0582515Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0583258Z                            module_map=module_map)
2025-05-07T20:32:54.0583819Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0584366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.0584773Z E       ^
2025-05-07T20:32:54.0585673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0586393Z 
2025-05-07T20:32:54.0587046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.0587868Z 
2025-05-07T20:32:54.1391482Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1392331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1393067Z     T=16384,
2025-05-07T20:32:54.1393432Z     D=5120,
2025-05-07T20:32:54.1393910Z     scale_ub=None,
2025-05-07T20:32:54.1394314Z     contiguous=False,
2025-05-07T20:32:54.1394735Z     compiled=False,
2025-05-07T20:32:54.1395123Z )
2025-05-07T20:32:54.1395705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1396954Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.1397489Z 
2025-05-07T20:32:54.1397659Z     @given(
2025-05-07T20:32:54.1398091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1398679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1399270Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1400024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1400644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1401184Z     )
2025-05-07T20:32:54.1401856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1402709Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1403170Z         self,
2025-05-07T20:32:54.1403538Z         T: int,
2025-05-07T20:32:54.1403904Z         D: int,
2025-05-07T20:32:54.1404312Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1404839Z         contiguous: bool,
2025-05-07T20:32:54.1405289Z         compiled: bool,
2025-05-07T20:32:54.1405727Z     ) -> None:
2025-05-07T20:32:54.1416508Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1417034Z     
2025-05-07T20:32:54.1417554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1418220Z     
2025-05-07T20:32:54.1418584Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1419157Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1423106Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1427100Z 
2025-05-07T20:32:54.1427339Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.1427744Z 
2025-05-07T20:32:54.1427927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1428660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1429415Z     T=4096,
2025-05-07T20:32:54.1429763Z     D=7168,
2025-05-07T20:32:54.1430116Z     scale_ub=1200.0,
2025-05-07T20:32:54.1430521Z     contiguous=True,
2025-05-07T20:32:54.1430929Z     compiled=True,
2025-05-07T20:32:54.1431310Z )
2025-05-07T20:32:54.1431899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1432821Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1433338Z 
2025-05-07T20:32:54.1433492Z     @given(
2025-05-07T20:32:54.1434004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1434599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1435174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1436049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1436676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1437213Z     )
2025-05-07T20:32:54.1437879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1438718Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1439176Z         self,
2025-05-07T20:32:54.1439540Z         T: int,
2025-05-07T20:32:54.1439893Z         D: int,
2025-05-07T20:32:54.1440296Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1440805Z         contiguous: bool,
2025-05-07T20:32:54.1441242Z         compiled: bool,
2025-05-07T20:32:54.1441660Z     ) -> None:
2025-05-07T20:32:54.1442054Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1442495Z     
2025-05-07T20:32:54.1442995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1443757Z     
2025-05-07T20:32:54.1444112Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1444653Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1448581Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1452322Z 
2025-05-07T20:32:54.1452546Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.1452944Z 
2025-05-07T20:32:54.1453139Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1453901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1454661Z     T=16384,
2025-05-07T20:32:54.1455016Z     D=7168,
2025-05-07T20:32:54.1455372Z     scale_ub=None,
2025-05-07T20:32:54.1455752Z     contiguous=False,
2025-05-07T20:32:54.1456171Z     compiled=False,
2025-05-07T20:32:54.1456551Z )
2025-05-07T20:32:54.1457124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1458069Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.1458598Z 
2025-05-07T20:32:54.1458749Z     @given(
2025-05-07T20:32:54.1459159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1459744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1460308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1460913Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1461533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1462074Z     )
2025-05-07T20:32:54.1462728Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1463544Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1464010Z         self,
2025-05-07T20:32:54.1464370Z         T: int,
2025-05-07T20:32:54.1464724Z         D: int,
2025-05-07T20:32:54.1465125Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1465639Z         contiguous: bool,
2025-05-07T20:32:54.1466104Z         compiled: bool,
2025-05-07T20:32:54.1466538Z     ) -> None:
2025-05-07T20:32:54.1466938Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1467382Z     
2025-05-07T20:32:54.1467885Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1472056Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1475833Z 
2025-05-07T20:32:54.1476060Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.1476473Z 
2025-05-07T20:32:54.1476674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1477502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1478259Z     T=2048,
2025-05-07T20:32:54.1478605Z     D=7168,
2025-05-07T20:32:54.1478946Z     scale_ub=1200.0,
2025-05-07T20:32:54.1479363Z     contiguous=True,
2025-05-07T20:32:54.1479775Z     compiled=True,
2025-05-07T20:32:54.1480141Z )
2025-05-07T20:32:54.1480726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.1481757Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:54.1482270Z 
2025-05-07T20:32:54.1482421Z     @given(
2025-05-07T20:32:54.1482854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.1483426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.1483981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.1484673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.1485286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.1485825Z     )
2025-05-07T20:32:54.1486479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.1487342Z     def test_silu_mul_quant(
2025-05-07T20:32:54.1487793Z         self,
2025-05-07T20:32:54.1488148Z         T: int,
2025-05-07T20:32:54.1488514Z         D: int,
2025-05-07T20:32:54.1488948Z         scale_ub: Optional[float],
2025-05-07T20:32:54.1489462Z         contiguous: bool,
2025-05-07T20:32:54.1489902Z         compiled: bool,
2025-05-07T20:32:54.1490324Z     ) -> None:
2025-05-07T20:32:54.1490726Z         torch.manual_seed(2025)
2025-05-07T20:32:54.1491187Z     
2025-05-07T20:32:54.1491690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.1492336Z     
2025-05-07T20:32:54.1492687Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.1493242Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.1497161Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.1500843Z 
2025-05-07T20:32:54.1501064Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:54.1501462Z 
2025-05-07T20:32:54.1501667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.1502440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.1503206Z     T=2048,
2025-05-07T20:32:54.1503557Z     D=7168,
2025-05-07T20:32:54.1503903Z     scale_ub=None,
2025-05-07T20:32:54.1504301Z     contiguous=True,
2025-05-07T20:32:54.1504717Z     compiled=False,
2025-05-07T20:32:54.1505092Z )
2025-05-07T20:32:54.2858448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2859442Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.2859953Z 
2025-05-07T20:32:54.2860105Z     @given(
2025-05-07T20:32:54.2860528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2861117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2861682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2862688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2863269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2863768Z     )
2025-05-07T20:32:54.2864374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2865177Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2865622Z         self,
2025-05-07T20:32:54.2865982Z         T: int,
2025-05-07T20:32:54.2866334Z         D: int,
2025-05-07T20:32:54.2866727Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2867220Z         contiguous: bool,
2025-05-07T20:32:54.2867679Z         compiled: bool,
2025-05-07T20:32:54.2868098Z     ) -> None:
2025-05-07T20:32:54.2868479Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2868922Z     
2025-05-07T20:32:54.2869423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2870232Z     
2025-05-07T20:32:54.2870585Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.2874500Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.2878371Z 
2025-05-07T20:32:54.2878607Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.2879026Z 
2025-05-07T20:32:54.2879219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2880016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2880797Z     T=1,
2025-05-07T20:32:54.2881138Z     D=7168,
2025-05-07T20:32:54.2881510Z     scale_ub=1200.0,
2025-05-07T20:32:54.2881919Z     contiguous=True,
2025-05-07T20:32:54.2882335Z     compiled=False,
2025-05-07T20:32:54.2882722Z )
2025-05-07T20:32:54.2883315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2884215Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2884728Z 
2025-05-07T20:32:54.2884873Z     @given(
2025-05-07T20:32:54.2885306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2885880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2886461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2887092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2887718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2888254Z     )
2025-05-07T20:32:54.2888917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2889790Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2890239Z         self,
2025-05-07T20:32:54.2890597Z         T: int,
2025-05-07T20:32:54.2890981Z         D: int,
2025-05-07T20:32:54.2891380Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2891897Z         contiguous: bool,
2025-05-07T20:32:54.2892346Z         compiled: bool,
2025-05-07T20:32:54.2892751Z     ) -> None:
2025-05-07T20:32:54.2893157Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2893606Z     
2025-05-07T20:32:54.2894091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2894716Z     
2025-05-07T20:32:54.2895044Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2895570Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2896142Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2896627Z         x0 = x[:, :D]
2025-05-07T20:32:54.2897040Z         x1 = x[:, D:]
2025-05-07T20:32:54.2897421Z     
2025-05-07T20:32:54.2897765Z         if contiguous:
2025-05-07T20:32:54.2898181Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2898646Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2899233Z     
2025-05-07T20:32:54.2899597Z         if scale_ub is not None:
2025-05-07T20:32:54.2900098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2900719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2901298Z             )
2025-05-07T20:32:54.2901670Z         else:
2025-05-07T20:32:54.2902056Z             scale_ub_tensor = None
2025-05-07T20:32:54.2902514Z     
2025-05-07T20:32:54.2902976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2903552Z             op = silu_mul_quant
2025-05-07T20:32:54.2904005Z             if compiled:
2025-05-07T20:32:54.2904430Z                 op = torch.compile(op)
2025-05-07T20:32:54.2904975Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2905499Z     
2025-05-07T20:32:54.2905968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2906294Z 
2025-05-07T20:32:54.2906477Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2907025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2907635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2908162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2909477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2910892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2911896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2913211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2914543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2915557Z     kernel = self.compile(
2025-05-07T20:32:54.2916642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2917909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2918657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2919096Z 
2025-05-07T20:32:54.2919483Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ecf9a0>
2025-05-07T20:32:54.2921524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2924551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e32cb0>}
2025-05-07T20:32:54.2927160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2929119Z context = <triton._C.libtriton.ir.context object at 0x7efdf6dfe4b0>
2025-05-07T20:32:54.2929673Z 
2025-05-07T20:32:54.2929980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2930974Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2931891Z                            module_map=module_map)
2025-05-07T20:32:54.2932561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2933234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2933727Z E       ^
2025-05-07T20:32:54.2934626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2935506Z 
2025-05-07T20:32:54.2936354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2937355Z 
2025-05-07T20:32:54.2937784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2938579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2939340Z     T=128,
2025-05-07T20:32:54.2939679Z     D=5120,
2025-05-07T20:32:54.2940042Z     scale_ub=None,
2025-05-07T20:32:54.2940443Z     contiguous=True,
2025-05-07T20:32:54.2940856Z     compiled=False,
2025-05-07T20:32:54.2941239Z )
2025-05-07T20:32:54.3774797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3775791Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3776263Z 
2025-05-07T20:32:54.3776425Z     @given(
2025-05-07T20:32:54.3776833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3777385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3778276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3778901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3779534Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3780074Z     )
2025-05-07T20:32:54.3780748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3781589Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3782221Z         self,
2025-05-07T20:32:54.3782587Z         T: int,
2025-05-07T20:32:54.3782947Z         D: int,
2025-05-07T20:32:54.3783356Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3783864Z         contiguous: bool,
2025-05-07T20:32:54.3784305Z         compiled: bool,
2025-05-07T20:32:54.3784724Z     ) -> None:
2025-05-07T20:32:54.3785132Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3785591Z     
2025-05-07T20:32:54.3786103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3786773Z     
2025-05-07T20:32:54.3787128Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.3787679Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.3788283Z         x = x_sign * x_clamp
2025-05-07T20:32:54.3788747Z         x0 = x[:, :D]
2025-05-07T20:32:54.3789151Z         x1 = x[:, D:]
2025-05-07T20:32:54.3789558Z     
2025-05-07T20:32:54.3789914Z         if contiguous:
2025-05-07T20:32:54.3790356Z             x0 = x0.contiguous()
2025-05-07T20:32:54.3790856Z             x1 = x1.contiguous()
2025-05-07T20:32:54.3791322Z     
2025-05-07T20:32:54.3791679Z         if scale_ub is not None:
2025-05-07T20:32:54.3792208Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.3792851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.3793441Z             )
2025-05-07T20:32:54.3793976Z         else:
2025-05-07T20:32:54.3794386Z             scale_ub_tensor = None
2025-05-07T20:32:54.3794862Z     
2025-05-07T20:32:54.3795289Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3795888Z             op = silu_mul_quant
2025-05-07T20:32:54.3796354Z             if compiled:
2025-05-07T20:32:54.3796822Z                 op = torch.compile(op)
2025-05-07T20:32:54.3797379Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3797895Z     
2025-05-07T20:32:54.3798252Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.3798579Z 
2025-05-07T20:32:54.3798772Z moe/activation_test.py:117: 
2025-05-07T20:32:54.3799336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3799981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.3800524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3801863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.3803237Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.3804288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.3805647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.3807231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3808228Z     kernel = self.compile(
2025-05-07T20:32:54.3809277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3810554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3811309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3811740Z 
2025-05-07T20:32:54.3812116Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6df32e0>
2025-05-07T20:32:54.3814193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3817041Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e33640>}
2025-05-07T20:32:54.3819661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3821736Z context = <triton._C.libtriton.ir.context object at 0x7efdf6ac9a30>
2025-05-07T20:32:54.3822305Z 
2025-05-07T20:32:54.3822614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3823605Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3824897Z                            module_map=module_map)
2025-05-07T20:32:54.3825576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3826231Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.3826716Z E       ^
2025-05-07T20:32:54.3827599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3828497Z 
2025-05-07T20:32:54.3829307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3830321Z 
2025-05-07T20:32:54.3830514Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3831300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3832058Z     T=128,
2025-05-07T20:32:54.3832412Z     D=7168,
2025-05-07T20:32:54.3832763Z     scale_ub=None,
2025-05-07T20:32:54.3833154Z     contiguous=True,
2025-05-07T20:32:54.3833635Z     compiled=False,
2025-05-07T20:32:54.3834031Z )
2025-05-07T20:32:54.3834624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.3835558Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.3836078Z 
2025-05-07T20:32:54.3836223Z     @given(
2025-05-07T20:32:54.3836656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.3837240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.3837818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.3838448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.3839061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.3839593Z     )
2025-05-07T20:32:54.3840256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.3841092Z     def test_silu_mul_quant(
2025-05-07T20:32:54.3841552Z         self,
2025-05-07T20:32:54.3841912Z         T: int,
2025-05-07T20:32:54.3842272Z         D: int,
2025-05-07T20:32:54.3842681Z         scale_ub: Optional[float],
2025-05-07T20:32:54.3843193Z         contiguous: bool,
2025-05-07T20:32:54.3843638Z         compiled: bool,
2025-05-07T20:32:54.3844041Z     ) -> None:
2025-05-07T20:32:54.3844440Z         torch.manual_seed(2025)
2025-05-07T20:32:54.3845110Z     
2025-05-07T20:32:54.3845620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.3846268Z     
2025-05-07T20:32:54.3846633Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.3847173Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.3847764Z         x = x_sign * x_clamp
2025-05-07T20:32:54.3848221Z         x0 = x[:, :D]
2025-05-07T20:32:54.3848623Z         x1 = x[:, D:]
2025-05-07T20:32:54.3849012Z     
2025-05-07T20:32:54.3849360Z         if contiguous:
2025-05-07T20:32:54.3849792Z             x0 = x0.contiguous()
2025-05-07T20:32:54.3850282Z             x1 = x1.contiguous()
2025-05-07T20:32:54.3850742Z     
2025-05-07T20:32:54.3851096Z         if scale_ub is not None:
2025-05-07T20:32:54.3851616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.3852380Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.3852965Z             )
2025-05-07T20:32:54.3853319Z         else:
2025-05-07T20:32:54.3853727Z             scale_ub_tensor = None
2025-05-07T20:32:54.3854205Z     
2025-05-07T20:32:54.3854634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.3855240Z             op = silu_mul_quant
2025-05-07T20:32:54.3855841Z             if compiled:
2025-05-07T20:32:54.3856305Z                 op = torch.compile(op)
2025-05-07T20:32:54.3856924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3857454Z     
2025-05-07T20:32:54.3857808Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.3858132Z 
2025-05-07T20:32:54.3858317Z moe/activation_test.py:117: 
2025-05-07T20:32:54.3858880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3859504Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.3860038Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.3861710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.3863220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.3864326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.3865908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.3878190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3879288Z     kernel = self.compile(
2025-05-07T20:32:54.3880350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3881633Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3882403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3882857Z 
2025-05-07T20:32:54.3883246Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ecf460>
2025-05-07T20:32:54.3885380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3888117Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6ab4160>}
2025-05-07T20:32:54.3890763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3892674Z context = <triton._C.libtriton.ir.context object at 0x7efdf6a1d1b0>
2025-05-07T20:32:54.3893244Z 
2025-05-07T20:32:54.3893549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3894684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3895581Z                            module_map=module_map)
2025-05-07T20:32:54.3896248Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3896952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.3897443Z E       ^
2025-05-07T20:32:54.3898355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3899272Z 
2025-05-07T20:32:54.3900095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.3901127Z 
2025-05-07T20:32:54.3901323Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.3902134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.3902990Z     T=2048,
2025-05-07T20:32:54.3903351Z     D=7168,
2025-05-07T20:32:54.3903721Z     scale_ub=1200.0,
2025-05-07T20:32:54.3904139Z     contiguous=True,
2025-05-07T20:32:54.3904576Z     compiled=False,
2025-05-07T20:32:54.3904973Z )
2025-05-07T20:32:54.4895122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4896075Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.4896869Z 
2025-05-07T20:32:54.4897009Z     @given(
2025-05-07T20:32:54.4897439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4897992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4898546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4899145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4899740Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4900261Z     )
2025-05-07T20:32:54.4900884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4901721Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4902156Z         self,
2025-05-07T20:32:54.4902498Z         T: int,
2025-05-07T20:32:54.4902861Z         D: int,
2025-05-07T20:32:54.4903253Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4903728Z         contiguous: bool,
2025-05-07T20:32:54.4904153Z         compiled: bool,
2025-05-07T20:32:54.4904562Z     ) -> None:
2025-05-07T20:32:54.4904944Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4905378Z     
2025-05-07T20:32:54.4905873Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4909803Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4913403Z 
2025-05-07T20:32:54.4913766Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.4914176Z 
2025-05-07T20:32:54.4914361Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4915113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4915850Z     T=1,
2025-05-07T20:32:54.4916177Z     D=5120,
2025-05-07T20:32:54.4916515Z     scale_ub=1200.0,
2025-05-07T20:32:54.4916905Z     contiguous=True,
2025-05-07T20:32:54.4917310Z     compiled=False,
2025-05-07T20:32:54.4917675Z )
2025-05-07T20:32:54.4918250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4919138Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.4919633Z 
2025-05-07T20:32:54.4919773Z     @given(
2025-05-07T20:32:54.4920182Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4921011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4921585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4922212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4922821Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4923366Z     )
2025-05-07T20:32:54.4924366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4925196Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4925641Z         self,
2025-05-07T20:32:54.4925984Z         T: int,
2025-05-07T20:32:54.4926335Z         D: int,
2025-05-07T20:32:54.4926722Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4927188Z         contiguous: bool,
2025-05-07T20:32:54.4927623Z         compiled: bool,
2025-05-07T20:32:54.4928021Z     ) -> None:
2025-05-07T20:32:54.4928623Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4929067Z     
2025-05-07T20:32:54.4929554Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4930158Z     
2025-05-07T20:32:54.4930497Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.4931011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.4931558Z         x = x_sign * x_clamp
2025-05-07T20:32:54.4932108Z         x0 = x[:, :D]
2025-05-07T20:32:54.4932492Z         x1 = x[:, D:]
2025-05-07T20:32:54.4932862Z     
2025-05-07T20:32:54.4933188Z         if contiguous:
2025-05-07T20:32:54.4933599Z             x0 = x0.contiguous()
2025-05-07T20:32:54.4934066Z             x1 = x1.contiguous()
2025-05-07T20:32:54.4934491Z     
2025-05-07T20:32:54.4934832Z         if scale_ub is not None:
2025-05-07T20:32:54.4935328Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.4935933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.4936496Z             )
2025-05-07T20:32:54.4936842Z         else:
2025-05-07T20:32:54.4937206Z             scale_ub_tensor = None
2025-05-07T20:32:54.4937650Z     
2025-05-07T20:32:54.4938066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.4938637Z             op = silu_mul_quant
2025-05-07T20:32:54.4939085Z             if compiled:
2025-05-07T20:32:54.4939529Z                 op = torch.compile(op)
2025-05-07T20:32:54.4940063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4940557Z     
2025-05-07T20:32:54.4940900Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.4941196Z 
2025-05-07T20:32:54.4941381Z moe/activation_test.py:117: 
2025-05-07T20:32:54.4941898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4942506Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.4943017Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.4944256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.4945533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.4946520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.4947779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.4948988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.4950006Z     kernel = self.compile(
2025-05-07T20:32:54.4950995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.4952186Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.4952911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.4953338Z 
2025-05-07T20:32:54.4953828Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6a68af0>
2025-05-07T20:32:54.4956023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.4958581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6ab4940>}
2025-05-07T20:32:54.4961079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.4962971Z context = <triton._C.libtriton.ir.context object at 0x7efdf6a6f130>
2025-05-07T20:32:54.4963501Z 
2025-05-07T20:32:54.4963810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.4964899Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.4965755Z                            module_map=module_map)
2025-05-07T20:32:54.4966418Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.4967050Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.4967511Z E       ^
2025-05-07T20:32:54.4968374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.4969295Z 
2025-05-07T20:32:54.4970077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.4971042Z 
2025-05-07T20:32:54.4971242Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4971990Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4972734Z     T=2048,
2025-05-07T20:32:54.4973074Z     D=5120,
2025-05-07T20:32:54.4973416Z     scale_ub=None,
2025-05-07T20:32:54.4973808Z     contiguous=True,
2025-05-07T20:32:54.4974217Z     compiled=False,
2025-05-07T20:32:54.4974578Z )
2025-05-07T20:32:54.4975163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.4976066Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.4976564Z 
2025-05-07T20:32:54.4976714Z     @given(
2025-05-07T20:32:54.4977124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.4977702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.4978256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.4978847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.4979439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.4979958Z     )
2025-05-07T20:32:54.4980592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.4981408Z     def test_silu_mul_quant(
2025-05-07T20:32:54.4981858Z         self,
2025-05-07T20:32:54.4982206Z         T: int,
2025-05-07T20:32:54.4982570Z         D: int,
2025-05-07T20:32:54.4982960Z         scale_ub: Optional[float],
2025-05-07T20:32:54.4983433Z         contiguous: bool,
2025-05-07T20:32:54.4983860Z         compiled: bool,
2025-05-07T20:32:54.4984266Z     ) -> None:
2025-05-07T20:32:54.4984652Z         torch.manual_seed(2025)
2025-05-07T20:32:54.4985085Z     
2025-05-07T20:32:54.4985581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.4986210Z     
2025-05-07T20:32:54.4986560Z >       x_sign = torch.sign(x)
2025-05-07T20:32:54.4990391Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.4994045Z 
2025-05-07T20:32:54.4994263Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:54.4994652Z 
2025-05-07T20:32:54.4994846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.4995603Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.4996356Z     T=16384,
2025-05-07T20:32:54.4996718Z     D=5120,
2025-05-07T20:32:54.4997062Z     scale_ub=None,
2025-05-07T20:32:54.4997442Z     contiguous=True,
2025-05-07T20:32:54.4997843Z     compiled=False,
2025-05-07T20:32:54.4998216Z )
2025-05-07T20:32:54.5980465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.5981428Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.5981914Z 
2025-05-07T20:32:54.5982353Z     @given(
2025-05-07T20:32:54.5982761Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.5983323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.5983882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.5984487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.5985094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.5985760Z     )
2025-05-07T20:32:54.5986398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.5987224Z     def test_silu_mul_quant(
2025-05-07T20:32:54.5987663Z         self,
2025-05-07T20:32:54.5988010Z         T: int,
2025-05-07T20:32:54.5988358Z         D: int,
2025-05-07T20:32:54.5988748Z         scale_ub: Optional[float],
2025-05-07T20:32:54.5989231Z         contiguous: bool,
2025-05-07T20:32:54.5989655Z         compiled: bool,
2025-05-07T20:32:54.5990057Z     ) -> None:
2025-05-07T20:32:54.5990452Z         torch.manual_seed(2025)
2025-05-07T20:32:54.5990897Z     
2025-05-07T20:32:54.5991391Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.5995425Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.5999027Z 
2025-05-07T20:32:54.5999259Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.5999644Z 
2025-05-07T20:32:54.5999837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6000577Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6001309Z     T=4096,
2025-05-07T20:32:54.6001644Z     D=5120,
2025-05-07T20:32:54.6001975Z     scale_ub=None,
2025-05-07T20:32:54.6002368Z     contiguous=True,
2025-05-07T20:32:54.6002771Z     compiled=False,
2025-05-07T20:32:54.6003139Z )
2025-05-07T20:32:54.6003715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6004636Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.6005139Z 
2025-05-07T20:32:54.6005278Z     @given(
2025-05-07T20:32:54.6005693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6006317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6006885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6007496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6008098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6008640Z     )
2025-05-07T20:32:54.6009294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6010136Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6010588Z         self,
2025-05-07T20:32:54.6011167Z         T: int,
2025-05-07T20:32:54.6011529Z         D: int,
2025-05-07T20:32:54.6011922Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6012394Z         contiguous: bool,
2025-05-07T20:32:54.6012827Z         compiled: bool,
2025-05-07T20:32:54.6013229Z     ) -> None:
2025-05-07T20:32:54.6013611Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6014054Z     
2025-05-07T20:32:54.6014547Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6018194Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.6021720Z 
2025-05-07T20:32:54.6021952Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.6022337Z 
2025-05-07T20:32:54.6022603Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6023372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6024359Z     T=2048,
2025-05-07T20:32:54.6024695Z     D=5120,
2025-05-07T20:32:54.6025035Z     scale_ub=None,
2025-05-07T20:32:54.6025429Z     contiguous=False,
2025-05-07T20:32:54.6025823Z     compiled=False,
2025-05-07T20:32:54.6026191Z )
2025-05-07T20:32:54.6026755Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6027649Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.6028152Z 
2025-05-07T20:32:54.6028289Z     @given(
2025-05-07T20:32:54.6028696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6029272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6029827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6030442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6031047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6031548Z     )
2025-05-07T20:32:54.6032182Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6032992Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6033426Z         self,
2025-05-07T20:32:54.6033860Z         T: int,
2025-05-07T20:32:54.6034209Z         D: int,
2025-05-07T20:32:54.6034594Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6035076Z         contiguous: bool,
2025-05-07T20:32:54.6035504Z         compiled: bool,
2025-05-07T20:32:54.6035906Z     ) -> None:
2025-05-07T20:32:54.6036330Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6036769Z     
2025-05-07T20:32:54.6037261Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6041030Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.6044515Z 
2025-05-07T20:32:54.6044734Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.6045130Z 
2025-05-07T20:32:54.6045315Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6046060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6046793Z     T=4096,
2025-05-07T20:32:54.6047116Z     D=7168,
2025-05-07T20:32:54.6047664Z     scale_ub=None,
2025-05-07T20:32:54.6048065Z     contiguous=True,
2025-05-07T20:32:54.6048451Z     compiled=True,
2025-05-07T20:32:54.6048812Z )
2025-05-07T20:32:54.6049379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6050257Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.6050757Z 
2025-05-07T20:32:54.6050895Z     @given(
2025-05-07T20:32:54.6051302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6051852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6052395Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6052994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6053587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6054216Z     )
2025-05-07T20:32:54.6054854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6055677Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6056105Z         self,
2025-05-07T20:32:54.6056455Z         T: int,
2025-05-07T20:32:54.6056827Z         D: int,
2025-05-07T20:32:54.6057253Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6057884Z         contiguous: bool,
2025-05-07T20:32:54.6058326Z         compiled: bool,
2025-05-07T20:32:54.6058723Z     ) -> None:
2025-05-07T20:32:54.6059116Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6059568Z     
2025-05-07T20:32:54.6060052Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6063967Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.6067489Z 
2025-05-07T20:32:54.6067711Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.6068111Z 
2025-05-07T20:32:54.6068297Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6069055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6069785Z     T=2048,
2025-05-07T20:32:54.6070131Z     D=5120,
2025-05-07T20:32:54.6070488Z     scale_ub=1200.0,
2025-05-07T20:32:54.6070873Z     contiguous=False,
2025-05-07T20:32:54.6071278Z     compiled=False,
2025-05-07T20:32:54.6071631Z )
2025-05-07T20:32:54.6072204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.6073115Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:54.6073724Z 
2025-05-07T20:32:54.6073866Z     @given(
2025-05-07T20:32:54.6074293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.6074858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.6075421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.6076032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.6076678Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.6077205Z     )
2025-05-07T20:32:54.6077851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.6078677Z     def test_silu_mul_quant(
2025-05-07T20:32:54.6079109Z         self,
2025-05-07T20:32:54.6079462Z         T: int,
2025-05-07T20:32:54.6079819Z         D: int,
2025-05-07T20:32:54.6080209Z         scale_ub: Optional[float],
2025-05-07T20:32:54.6080703Z         contiguous: bool,
2025-05-07T20:32:54.6081149Z         compiled: bool,
2025-05-07T20:32:54.6081545Z     ) -> None:
2025-05-07T20:32:54.6081929Z         torch.manual_seed(2025)
2025-05-07T20:32:54.6082372Z     
2025-05-07T20:32:54.6082994Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.6086942Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.6090516Z 
2025-05-07T20:32:54.6090730Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.6091199Z 
2025-05-07T20:32:54.6091388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.6092151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.6092891Z     T=4096,
2025-05-07T20:32:54.6093231Z     D=7168,
2025-05-07T20:32:54.6093581Z     scale_ub=1200.0,
2025-05-07T20:32:54.6093961Z     contiguous=True,
2025-05-07T20:32:54.6094364Z     compiled=False,
2025-05-07T20:32:54.6094811Z )
2025-05-07T20:32:54.7449452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7450441Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.7450946Z 
2025-05-07T20:32:54.7451095Z     @given(
2025-05-07T20:32:54.7451521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7452092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7452653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7453251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7453837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7454329Z     )
2025-05-07T20:32:54.7454939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7455759Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7456179Z         self,
2025-05-07T20:32:54.7456517Z         T: int,
2025-05-07T20:32:54.7456902Z         D: int,
2025-05-07T20:32:54.7457352Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7457867Z         contiguous: bool,
2025-05-07T20:32:54.7458324Z         compiled: bool,
2025-05-07T20:32:54.7458752Z     ) -> None:
2025-05-07T20:32:54.7459155Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7459619Z     
2025-05-07T20:32:54.7460130Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7464109Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.7467877Z 
2025-05-07T20:32:54.7468119Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.7468525Z 
2025-05-07T20:32:54.7468721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7469522Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7470307Z     T=16384,
2025-05-07T20:32:54.7470670Z     D=7168,
2025-05-07T20:32:54.7471032Z     scale_ub=None,
2025-05-07T20:32:54.7471446Z     contiguous=False,
2025-05-07T20:32:54.7471865Z     compiled=True,
2025-05-07T20:32:54.7472264Z )
2025-05-07T20:32:54.7472884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7474008Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.7475041Z 
2025-05-07T20:32:54.7475202Z     @given(
2025-05-07T20:32:54.7475641Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7476243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7476831Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7477511Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7478154Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7478703Z     )
2025-05-07T20:32:54.7479381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7480274Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7480737Z         self,
2025-05-07T20:32:54.7481101Z         T: int,
2025-05-07T20:32:54.7481476Z         D: int,
2025-05-07T20:32:54.7482124Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7482725Z         contiguous: bool,
2025-05-07T20:32:54.7495245Z         compiled: bool,
2025-05-07T20:32:54.7495713Z     ) -> None:
2025-05-07T20:32:54.7496142Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7496623Z     
2025-05-07T20:32:54.7497135Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7501093Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.7504927Z 
2025-05-07T20:32:54.7505147Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.7505531Z 
2025-05-07T20:32:54.7505721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7506463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7507266Z     T=4096,
2025-05-07T20:32:54.7507621Z     D=7168,
2025-05-07T20:32:54.7507985Z     scale_ub=None,
2025-05-07T20:32:54.7508395Z     contiguous=True,
2025-05-07T20:32:54.7508828Z     compiled=False,
2025-05-07T20:32:54.7509221Z )
2025-05-07T20:32:54.7509818Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7510770Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.7511303Z 
2025-05-07T20:32:54.7511459Z     @given(
2025-05-07T20:32:54.7511893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7512482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7513076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7513806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7514439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7515005Z     )
2025-05-07T20:32:54.7515689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7516562Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7517054Z         self,
2025-05-07T20:32:54.7517436Z         T: int,
2025-05-07T20:32:54.7517812Z         D: int,
2025-05-07T20:32:54.7518229Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7518763Z         contiguous: bool,
2025-05-07T20:32:54.7519220Z         compiled: bool,
2025-05-07T20:32:54.7519645Z     ) -> None:
2025-05-07T20:32:54.7520056Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7520521Z     
2025-05-07T20:32:54.7521028Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7525770Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.7529412Z 
2025-05-07T20:32:54.7529647Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.7530050Z 
2025-05-07T20:32:54.7530254Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7531034Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7531806Z     T=16384,
2025-05-07T20:32:54.7532176Z     D=7168,
2025-05-07T20:32:54.7532536Z     scale_ub=None,
2025-05-07T20:32:54.7532937Z     contiguous=True,
2025-05-07T20:32:54.7533509Z     compiled=False,
2025-05-07T20:32:54.7533906Z )
2025-05-07T20:32:54.7534496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7535452Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:54.7535994Z 
2025-05-07T20:32:54.7536153Z     @given(
2025-05-07T20:32:54.7536580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7537321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7537907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7538529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7539160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7539710Z     )
2025-05-07T20:32:54.7540380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7541227Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7541700Z         self,
2025-05-07T20:32:54.7542083Z         T: int,
2025-05-07T20:32:54.7542452Z         D: int,
2025-05-07T20:32:54.7542865Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7543381Z         contiguous: bool,
2025-05-07T20:32:54.7543842Z         compiled: bool,
2025-05-07T20:32:54.7544271Z     ) -> None:
2025-05-07T20:32:54.7544675Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7545133Z     
2025-05-07T20:32:54.7545650Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7549690Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.7553317Z 
2025-05-07T20:32:54.7553621Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.7554035Z 
2025-05-07T20:32:54.7554253Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7555032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7555801Z     T=16384,
2025-05-07T20:32:54.7556184Z     D=7168,
2025-05-07T20:32:54.7556540Z     scale_ub=1200.0,
2025-05-07T20:32:54.7556989Z     contiguous=True,
2025-05-07T20:32:54.7557444Z     compiled=False,
2025-05-07T20:32:54.7557828Z )
2025-05-07T20:32:54.7558425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.7559386Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.7559915Z 
2025-05-07T20:32:54.7560075Z     @given(
2025-05-07T20:32:54.7560499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.7561111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.7561703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.7562330Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.7563113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.7563679Z     )
2025-05-07T20:32:54.7564345Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.7565220Z     def test_silu_mul_quant(
2025-05-07T20:32:54.7565683Z         self,
2025-05-07T20:32:54.7566048Z         T: int,
2025-05-07T20:32:54.7566426Z         D: int,
2025-05-07T20:32:54.7566846Z         scale_ub: Optional[float],
2025-05-07T20:32:54.7567356Z         contiguous: bool,
2025-05-07T20:32:54.7567817Z         compiled: bool,
2025-05-07T20:32:54.7568246Z     ) -> None:
2025-05-07T20:32:54.7568660Z         torch.manual_seed(2025)
2025-05-07T20:32:54.7569121Z     
2025-05-07T20:32:54.7569646Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.7573731Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:54.7577399Z 
2025-05-07T20:32:54.7577646Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:54.7578065Z 
2025-05-07T20:32:54.7578264Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.7579051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.7579829Z     T=128,
2025-05-07T20:32:54.7580189Z     D=5120,
2025-05-07T20:32:54.7580560Z     scale_ub=1200.0,
2025-05-07T20:32:54.7580994Z     contiguous=False,
2025-05-07T20:32:54.7581433Z     compiled=False,
2025-05-07T20:32:54.7581821Z )
2025-05-07T20:32:55.0994110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.0995048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.0995521Z 
2025-05-07T20:32:55.0995658Z     @given(
2025-05-07T20:32:55.0996081Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.0996600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.0997123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.0997703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.0998288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.0998771Z     )
2025-05-07T20:32:55.0999293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.0999960Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1000338Z         self,
2025-05-07T20:32:55.1000633Z         T: int,
2025-05-07T20:32:55.1000943Z         D: int,
2025-05-07T20:32:55.1001287Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1001719Z         contiguous: bool,
2025-05-07T20:32:55.1002111Z         compiled: bool,
2025-05-07T20:32:55.1002487Z     ) -> None:
2025-05-07T20:32:55.1002835Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1003221Z     
2025-05-07T20:32:55.1003645Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1004184Z     
2025-05-07T20:32:55.1004489Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1004980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1005483Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1005898Z         x0 = x[:, :D]
2025-05-07T20:32:55.1006283Z         x1 = x[:, D:]
2025-05-07T20:32:55.1006648Z     
2025-05-07T20:32:55.1006980Z         if contiguous:
2025-05-07T20:32:55.1007397Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1007865Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1008298Z     
2025-05-07T20:32:55.1008642Z         if scale_ub is not None:
2025-05-07T20:32:55.1009570Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1010188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1010744Z             )
2025-05-07T20:32:55.1011091Z         else:
2025-05-07T20:32:55.1011467Z             scale_ub_tensor = None
2025-05-07T20:32:55.1011913Z     
2025-05-07T20:32:55.1012325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1012877Z             op = silu_mul_quant
2025-05-07T20:32:55.1013330Z             if compiled:
2025-05-07T20:32:55.1013779Z                 op = torch.compile(op)
2025-05-07T20:32:55.1014283Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1014733Z     
2025-05-07T20:32:55.1015053Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.1015341Z 
2025-05-07T20:32:55.1015668Z moe/activation_test.py:117: 
2025-05-07T20:32:55.1016204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1016788Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.1017285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1018527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.1019894Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.1020868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1022104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1023279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.1024686Z     kernel = self.compile(
2025-05-07T20:32:55.1025695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.1026948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1027672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1028119Z 
2025-05-07T20:32:55.1028488Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf691c610>
2025-05-07T20:32:55.1030453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.1032995Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6948940>}
2025-05-07T20:32:55.1035579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.1037539Z context = <triton._C.libtriton.ir.context object at 0x7efdf69f5bf0>
2025-05-07T20:32:55.1038067Z 
2025-05-07T20:32:55.1038364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.1039323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1040152Z                            module_map=module_map)
2025-05-07T20:32:55.1040787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1041388Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1041812Z E       ^
2025-05-07T20:32:55.1042579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1043426Z 
2025-05-07T20:32:55.1044141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.1045023Z 
2025-05-07T20:32:55.1045220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1046161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1046867Z     T=2048,
2025-05-07T20:32:55.1047193Z     D=7168,
2025-05-07T20:32:55.1047515Z     scale_ub=None,
2025-05-07T20:32:55.1047892Z     contiguous=False,
2025-05-07T20:32:55.1048287Z     compiled=False,
2025-05-07T20:32:55.1048655Z )
2025-05-07T20:32:55.1049198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1050061Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.1050542Z 
2025-05-07T20:32:55.1050687Z     @given(
2025-05-07T20:32:55.1051079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1051643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1052176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1052859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1053431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1053938Z     )
2025-05-07T20:32:55.1054554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1055356Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1055781Z         self,
2025-05-07T20:32:55.1056121Z         T: int,
2025-05-07T20:32:55.1056617Z         D: int,
2025-05-07T20:32:55.1056998Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1057469Z         contiguous: bool,
2025-05-07T20:32:55.1057887Z         compiled: bool,
2025-05-07T20:32:55.1058278Z     ) -> None:
2025-05-07T20:32:55.1058656Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1059085Z     
2025-05-07T20:32:55.1059543Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1063290Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.1066699Z 
2025-05-07T20:32:55.1066949Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.1067342Z 
2025-05-07T20:32:55.1067530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1068253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1068970Z     T=128,
2025-05-07T20:32:55.1069296Z     D=7168,
2025-05-07T20:32:55.1069624Z     scale_ub=1200.0,
2025-05-07T20:32:55.1070007Z     contiguous=True,
2025-05-07T20:32:55.1070401Z     compiled=True,
2025-05-07T20:32:55.1070762Z )
2025-05-07T20:32:55.1490857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1491780Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.1492238Z 
2025-05-07T20:32:55.1492387Z     @given(
2025-05-07T20:32:55.1492744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1493227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1493729Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1494305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1494881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1495380Z     )
2025-05-07T20:32:55.1496017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1496750Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1497141Z         self,
2025-05-07T20:32:55.1497450Z         T: int,
2025-05-07T20:32:55.1497762Z         D: int,
2025-05-07T20:32:55.1498106Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1498549Z         contiguous: bool,
2025-05-07T20:32:55.1498953Z         compiled: bool,
2025-05-07T20:32:55.1499622Z     ) -> None:
2025-05-07T20:32:55.1499989Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1500391Z     
2025-05-07T20:32:55.1500863Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1501488Z     
2025-05-07T20:32:55.1501793Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1502224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1502692Z         x = x_sign * x_clamp
2025-05-07T20:32:55.1503059Z         x0 = x[:, :D]
2025-05-07T20:32:55.1503378Z         x1 = x[:, D:]
2025-05-07T20:32:55.1503706Z     
2025-05-07T20:32:55.1503994Z         if contiguous:
2025-05-07T20:32:55.1504353Z             x0 = x0.contiguous()
2025-05-07T20:32:55.1504772Z             x1 = x1.contiguous()
2025-05-07T20:32:55.1505172Z     
2025-05-07T20:32:55.1505605Z         if scale_ub is not None:
2025-05-07T20:32:55.1506042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.1506582Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.1507081Z             )
2025-05-07T20:32:55.1507381Z         else:
2025-05-07T20:32:55.1507699Z             scale_ub_tensor = None
2025-05-07T20:32:55.1508071Z     
2025-05-07T20:32:55.1508420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.1508997Z             op = silu_mul_quant
2025-05-07T20:32:55.1509371Z             if compiled:
2025-05-07T20:32:55.1509756Z                 op = torch.compile(op)
2025-05-07T20:32:55.1510204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1510629Z     
2025-05-07T20:32:55.1510908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.1511160Z 
2025-05-07T20:32:55.1511309Z moe/activation_test.py:117: 
2025-05-07T20:32:55.1511749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1512246Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.1512669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.1513690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.1514599Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.1515675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.1516812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.1517675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.1518779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.1519864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.1520734Z     kernel = self.compile(
2025-05-07T20:32:55.1521602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.1522678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1523313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.1523677Z 
2025-05-07T20:32:55.1524563Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf67286d0>
2025-05-07T20:32:55.1526348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.1528669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6948dc0>}
2025-05-07T20:32:55.1530905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.1532809Z context = <triton._C.libtriton.ir.context object at 0x7efdf6781270>
2025-05-07T20:32:55.1533292Z 
2025-05-07T20:32:55.1533552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.1534401Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1535153Z                            module_map=module_map)
2025-05-07T20:32:55.1535711Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1536247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1536644Z E       ^
2025-05-07T20:32:55.1537387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1538145Z 
2025-05-07T20:32:55.1538848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.1539797Z 
2025-05-07T20:32:55.1539965Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1540606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1541262Z     T=128,
2025-05-07T20:32:55.1541562Z     D=7168,
2025-05-07T20:32:55.1541870Z     scale_ub=1200.0,
2025-05-07T20:32:55.1542342Z     contiguous=True,
2025-05-07T20:32:55.1542699Z     compiled=False,
2025-05-07T20:32:55.1543004Z )
2025-05-07T20:32:55.1543493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1544258Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.1544690Z 
2025-05-07T20:32:55.1544822Z     @given(
2025-05-07T20:32:55.1545181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1545693Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1546235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1546810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1547355Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1547799Z     )
2025-05-07T20:32:55.1548337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1549069Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1549489Z         self,
2025-05-07T20:32:55.1549810Z         T: int,
2025-05-07T20:32:55.1550131Z         D: int,
2025-05-07T20:32:55.1550496Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1550932Z         contiguous: bool,
2025-05-07T20:32:55.1551290Z         compiled: bool,
2025-05-07T20:32:55.1551633Z     ) -> None:
2025-05-07T20:32:55.1551960Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1552337Z     
2025-05-07T20:32:55.1552764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1553318Z     
2025-05-07T20:32:55.1553704Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1554179Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1557506Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.1560472Z 
2025-05-07T20:32:55.1560657Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.1560979Z 
2025-05-07T20:32:55.1561142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1561757Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1562381Z     T=128,
2025-05-07T20:32:55.1562662Z     D=5120,
2025-05-07T20:32:55.1562942Z     scale_ub=1200.0,
2025-05-07T20:32:55.1563275Z     contiguous=True,
2025-05-07T20:32:55.1563720Z     compiled=True,
2025-05-07T20:32:55.1564031Z )
2025-05-07T20:32:55.1564511Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.1565260Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.1565681Z 
2025-05-07T20:32:55.1565807Z     @given(
2025-05-07T20:32:55.1566147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.1566621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.1567090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.1567585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.1568099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.1568540Z     )
2025-05-07T20:32:55.1569074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.1569843Z     def test_silu_mul_quant(
2025-05-07T20:32:55.1570216Z         self,
2025-05-07T20:32:55.1570515Z         T: int,
2025-05-07T20:32:55.1570821Z         D: int,
2025-05-07T20:32:55.1571155Z         scale_ub: Optional[float],
2025-05-07T20:32:55.1571561Z         contiguous: bool,
2025-05-07T20:32:55.1571929Z         compiled: bool,
2025-05-07T20:32:55.1572336Z     ) -> None:
2025-05-07T20:32:55.1572657Z         torch.manual_seed(2025)
2025-05-07T20:32:55.1573051Z     
2025-05-07T20:32:55.1573463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.1573995Z     
2025-05-07T20:32:55.1574283Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.1574727Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.1578187Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.1581402Z 
2025-05-07T20:32:55.1581645Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.1582052Z 
2025-05-07T20:32:55.1582244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.1582997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.1593464Z     T=128,
2025-05-07T20:32:55.1593829Z     D=7168,
2025-05-07T20:32:55.1594127Z     scale_ub=None,
2025-05-07T20:32:55.1594465Z     contiguous=True,
2025-05-07T20:32:55.1594811Z     compiled=True,
2025-05-07T20:32:55.1595119Z )
2025-05-07T20:32:55.4002495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4003412Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4003853Z 
2025-05-07T20:32:55.4004024Z     @given(
2025-05-07T20:32:55.4004396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4004879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4005393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4005934Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4006518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4007010Z     )
2025-05-07T20:32:55.4007579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4008303Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4008738Z         self,
2025-05-07T20:32:55.4009063Z         T: int,
2025-05-07T20:32:55.4009391Z         D: int,
2025-05-07T20:32:55.4009756Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4010251Z         contiguous: bool,
2025-05-07T20:32:55.4010675Z         compiled: bool,
2025-05-07T20:32:55.4011074Z     ) -> None:
2025-05-07T20:32:55.4011864Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4012305Z     
2025-05-07T20:32:55.4012777Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4016564Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.4019949Z 
2025-05-07T20:32:55.4020300Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.4020667Z 
2025-05-07T20:32:55.4047048Z FAILED
2025-05-07T20:32:55.4047248Z 
2025-05-07T20:32:55.4047460Z =================================== FAILURES ===================================
2025-05-07T20:32:55.4048024Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:55.4048543Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:55.4049409Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:32:55.4050071Z   |     yield
2025-05-07T20:32:55.4050602Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:32:55.4051213Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:55.4051854Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:32:55.4052521Z   |     method()
2025-05-07T20:32:55.4053251Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:55.4054077Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4054796Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:55.4055558Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:55.4056130Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:55.4056706Z   +-+---------------- 1 ----------------
2025-05-07T20:32:55.4057066Z     | Traceback (most recent call last):
2025-05-07T20:32:55.4057909Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:55.4058824Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4061219Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.4063509Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:55.4064022Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4064511Z     |     T=2048,
2025-05-07T20:32:55.4064795Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:55.4065187Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:55.4065622Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:55.4066065Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:55.4066593Z     | )
2025-05-07T20:32:55.4066855Z     | 
2025-05-07T20:32:55.4067477Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:55.4068225Z     +---------------- 2 ----------------
2025-05-07T20:32:55.4068537Z     | Traceback (most recent call last):
2025-05-07T20:32:55.4069387Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:55.4070303Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4072778Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.4076530Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:55.4077273Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4077983Z     |     T=128,
2025-05-07T20:32:55.4078319Z     |     D=7168,
2025-05-07T20:32:55.4078661Z     |     scale_ub=None,
2025-05-07T20:32:55.4079054Z     |     contiguous=True,
2025-05-07T20:32:55.4079438Z     |     compiled=True,
2025-05-07T20:32:55.4079790Z     | )
2025-05-07T20:32:55.4080079Z     | 
2025-05-07T20:32:55.4080771Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:55.4081431Z     +---------------- 3 ----------------
2025-05-07T20:32:55.4081746Z     | Traceback (most recent call last):
2025-05-07T20:32:55.4082495Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:55.4084290Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4086440Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.4088507Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:55.4088975Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4089415Z     |     T=128,
2025-05-07T20:32:55.4089636Z     |     D=5120,
2025-05-07T20:32:55.4089866Z     |     scale_ub=1200.0,
2025-05-07T20:32:55.4090132Z     |     contiguous=True,
2025-05-07T20:32:55.4090391Z     |     compiled=True,
2025-05-07T20:32:55.4090631Z     | )
2025-05-07T20:32:55.4090840Z     | 
2025-05-07T20:32:55.4091526Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:55.4092173Z     +---------------- 4 ----------------
2025-05-07T20:32:55.4092479Z     | Traceback (most recent call last):
2025-05-07T20:32:55.4093226Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:55.4093988Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4094803Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:55.4095541Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4096425Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:55.4097263Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4097900Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:55.4098807Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4099941Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:55.4101070Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4102253Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:32:55.4103496Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4104649Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:55.4105676Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4106673Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:55.4107508Z     |     fn()
2025-05-07T20:32:55.4108358Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:55.4109272Z     |     self.fn.run(
2025-05-07T20:32:55.4110047Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:55.4110926Z     |     kernel = self.compile(
2025-05-07T20:32:55.4111867Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:55.4112918Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4114094Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.4115252Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4116026Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4116552Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4116945Z     | ^
2025-05-07T20:32:55.4117634Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4118476Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:55.4119077Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:55.4119836Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4120486Z     |     T=1,  # or any other generated value
2025-05-07T20:32:55.4120958Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:55.4121448Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:55.4121990Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:55.4122535Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:55.4123004Z     | )
2025-05-07T20:32:55.4123278Z     | 
2025-05-07T20:32:55.4124565Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:55.4125453Z     +------------------------------------
2025-05-07T20:32:55.4125964Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:55.4126518Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4127127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4127710Z     T=1,
2025-05-07T20:32:55.4127990Z     D=5120,
2025-05-07T20:32:55.4128293Z     scale_ub=None,
2025-05-07T20:32:55.4128612Z     contiguous=True,
2025-05-07T20:32:55.4128949Z     compiled=True,
2025-05-07T20:32:55.4129273Z )
2025-05-07T20:32:55.4129741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4130560Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4130934Z 
2025-05-07T20:32:55.4131050Z     @given(
2025-05-07T20:32:55.4131395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4131857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4132330Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4132812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4133373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4133802Z     )
2025-05-07T20:32:55.4134329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4134992Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4135347Z         self,
2025-05-07T20:32:55.4135639Z         T: int,
2025-05-07T20:32:55.4135937Z         D: int,
2025-05-07T20:32:55.4136267Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4136695Z         contiguous: bool,
2025-05-07T20:32:55.4137057Z         compiled: bool,
2025-05-07T20:32:55.4137385Z     ) -> None:
2025-05-07T20:32:55.4137709Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4138068Z     
2025-05-07T20:32:55.4138466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4138981Z     
2025-05-07T20:32:55.4139277Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4139705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4140179Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4140553Z         x0 = x[:, :D]
2025-05-07T20:32:55.4140885Z         x1 = x[:, D:]
2025-05-07T20:32:55.4141210Z     
2025-05-07T20:32:55.4141499Z         if contiguous:
2025-05-07T20:32:55.4141848Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4142249Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4142622Z     
2025-05-07T20:32:55.4142911Z         if scale_ub is not None:
2025-05-07T20:32:55.4143300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4143783Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4144223Z             )
2025-05-07T20:32:55.4144504Z         else:
2025-05-07T20:32:55.4144816Z             scale_ub_tensor = None
2025-05-07T20:32:55.4145175Z     
2025-05-07T20:32:55.4145497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4145938Z             op = silu_mul_quant
2025-05-07T20:32:55.4146340Z             if compiled:
2025-05-07T20:32:55.4146696Z                 op = torch.compile(op)
2025-05-07T20:32:55.4147114Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4147512Z     
2025-05-07T20:32:55.4147782Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4148191Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4148603Z     
2025-05-07T20:32:55.4148942Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4149420Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4149862Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4150345Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4150873Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4151432Z     
2025-05-07T20:32:55.4151729Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4152008Z 
2025-05-07T20:32:55.4152154Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4152577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4153060Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4153642Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4154780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4155884Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4156648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4157648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4158654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4159688Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4160731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4161820Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4162832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4163729Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4164567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4165289Z     fn()
2025-05-07T20:32:55.4165998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4166818Z     self.fn.run(
2025-05-07T20:32:55.4167469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4168221Z     kernel = self.compile(
2025-05-07T20:32:55.4168978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4169907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4170480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4170810Z 
2025-05-07T20:32:55.4171102Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1b4d8040>
2025-05-07T20:32:55.4172676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4174637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1bc73400>}
2025-05-07T20:32:55.4176508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4177994Z context = <triton._C.libtriton.ir.context object at 0x7eff21a9f5b0>
2025-05-07T20:32:55.4178420Z 
2025-05-07T20:32:55.4178662Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4179420Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4180099Z                            module_map=module_map)
2025-05-07T20:32:55.4180627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4181144Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4181629Z E       ^
2025-05-07T20:32:55.4182303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4182964Z 
2025-05-07T20:32:55.4183568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4184279Z 
2025-05-07T20:32:55.4184440Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4185021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4185583Z     T=2048,
2025-05-07T20:32:55.4185859Z     D=5120,
2025-05-07T20:32:55.4186132Z     scale_ub=1200.0,
2025-05-07T20:32:55.4186458Z     contiguous=True,
2025-05-07T20:32:55.4186781Z     compiled=False,
2025-05-07T20:32:55.4187127Z )
2025-05-07T20:32:55.4187577Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4188281Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.4188678Z 
2025-05-07T20:32:55.4188803Z     @given(
2025-05-07T20:32:55.4189147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4189624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4190146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4190638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4191146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4191581Z     )
2025-05-07T20:32:55.4192103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4192759Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4193108Z         self,
2025-05-07T20:32:55.4193387Z         T: int,
2025-05-07T20:32:55.4193762Z         D: int,
2025-05-07T20:32:55.4194086Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4194471Z         contiguous: bool,
2025-05-07T20:32:55.4194803Z         compiled: bool,
2025-05-07T20:32:55.4195122Z     ) -> None:
2025-05-07T20:32:55.4195444Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4195787Z     
2025-05-07T20:32:55.4196179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4196664Z     
2025-05-07T20:32:55.4196945Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4197355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4197795Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4198137Z         x0 = x[:, :D]
2025-05-07T20:32:55.4198455Z         x1 = x[:, D:]
2025-05-07T20:32:55.4198754Z     
2025-05-07T20:32:55.4199026Z         if contiguous:
2025-05-07T20:32:55.4199368Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4199757Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4200112Z     
2025-05-07T20:32:55.4200391Z         if scale_ub is not None:
2025-05-07T20:32:55.4200787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4201272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4201730Z             )
2025-05-07T20:32:55.4202025Z         else:
2025-05-07T20:32:55.4202338Z             scale_ub_tensor = None
2025-05-07T20:32:55.4202707Z     
2025-05-07T20:32:55.4203041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4203504Z             op = silu_mul_quant
2025-05-07T20:32:55.4203864Z             if compiled:
2025-05-07T20:32:55.4204234Z                 op = torch.compile(op)
2025-05-07T20:32:55.4204682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4205074Z     
2025-05-07T20:32:55.4205358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4205595Z 
2025-05-07T20:32:55.4205747Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4206178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4206652Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4207055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4208085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4209030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4209786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4210756Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4211697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4212458Z     kernel = self.compile(
2025-05-07T20:32:55.4213233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4214188Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4214824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4215173Z 
2025-05-07T20:32:55.4215487Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1b965300>
2025-05-07T20:32:55.4217093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4219196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1b452ef0>}
2025-05-07T20:32:55.4221130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4222600Z context = <triton._C.libtriton.ir.context object at 0x7eff1a535d30>
2025-05-07T20:32:55.4223031Z 
2025-05-07T20:32:55.4223278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4224300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4225000Z                            module_map=module_map)
2025-05-07T20:32:55.4225527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4244273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4244728Z E       ^
2025-05-07T20:32:55.4245430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4246113Z 
2025-05-07T20:32:55.4246728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4247507Z 
2025-05-07T20:32:55.4247667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4248283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4248865Z     T=2048,
2025-05-07T20:32:55.4249140Z     D=5120,
2025-05-07T20:32:55.4249432Z     scale_ub=1200.0,
2025-05-07T20:32:55.4249781Z     contiguous=True,
2025-05-07T20:32:55.4250120Z     compiled=True,
2025-05-07T20:32:55.4250436Z )
2025-05-07T20:32:55.4250922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4251667Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.4252083Z 
2025-05-07T20:32:55.4252202Z     @given(
2025-05-07T20:32:55.4252556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4253031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4253499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4254003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4254512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4254939Z     )
2025-05-07T20:32:55.4255438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4256079Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4256723Z         self,
2025-05-07T20:32:55.4257031Z         T: int,
2025-05-07T20:32:55.4257330Z         D: int,
2025-05-07T20:32:55.4257649Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4258061Z         contiguous: bool,
2025-05-07T20:32:55.4258426Z         compiled: bool,
2025-05-07T20:32:55.4258755Z     ) -> None:
2025-05-07T20:32:55.4259080Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4259452Z     
2025-05-07T20:32:55.4259825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4260313Z     
2025-05-07T20:32:55.4260589Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4260997Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4261432Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4261782Z         x0 = x[:, :D]
2025-05-07T20:32:55.4262211Z         x1 = x[:, D:]
2025-05-07T20:32:55.4262528Z     
2025-05-07T20:32:55.4262810Z         if contiguous:
2025-05-07T20:32:55.4263151Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4263535Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4263899Z     
2025-05-07T20:32:55.4264187Z         if scale_ub is not None:
2025-05-07T20:32:55.4264590Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4265187Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4265636Z             )
2025-05-07T20:32:55.4265925Z         else:
2025-05-07T20:32:55.4266247Z             scale_ub_tensor = None
2025-05-07T20:32:55.4266631Z     
2025-05-07T20:32:55.4266972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4267450Z             op = silu_mul_quant
2025-05-07T20:32:55.4267832Z             if compiled:
2025-05-07T20:32:55.4268208Z                 op = torch.compile(op)
2025-05-07T20:32:55.4268652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4269073Z     
2025-05-07T20:32:55.4269368Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4269800Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4270239Z     
2025-05-07T20:32:55.4270526Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4270884Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4271207Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4271545Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4271924Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4272263Z     
2025-05-07T20:32:55.4272481Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4272689Z 
2025-05-07T20:32:55.4272801Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4273116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4273477Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4273967Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4274802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4275599Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4276178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4276902Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4277623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4278391Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4279192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4279988Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4280859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4281542Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4282177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4282721Z     fn()
2025-05-07T20:32:55.4283259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4283873Z     self.fn.run(
2025-05-07T20:32:55.4284371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4284928Z     kernel = self.compile(
2025-05-07T20:32:55.4285500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4286234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4286653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4286902Z 
2025-05-07T20:32:55.4287123Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1ba4ca30>
2025-05-07T20:32:55.4288271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4289776Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f211b0>}
2025-05-07T20:32:55.4291191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4292272Z context = <triton._C.libtriton.ir.context object at 0x7eff09d891b0>
2025-05-07T20:32:55.4292584Z 
2025-05-07T20:32:55.4292767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4293325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4293828Z                            module_map=module_map)
2025-05-07T20:32:55.4294212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4294593Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4294878Z E       ^
2025-05-07T20:32:55.4295367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4295848Z 
2025-05-07T20:32:55.4296289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4296835Z 
2025-05-07T20:32:55.4296947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4297388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4297814Z     T=16384,
2025-05-07T20:32:55.4298024Z     D=7168,
2025-05-07T20:32:55.4298235Z     scale_ub=1200.0,
2025-05-07T20:32:55.4298474Z     contiguous=False,
2025-05-07T20:32:55.4298717Z     compiled=False,
2025-05-07T20:32:55.4298940Z )
2025-05-07T20:32:55.4299276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4299813Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4300109Z 
2025-05-07T20:32:55.4300198Z     @given(
2025-05-07T20:32:55.4300449Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4300778Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4301105Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4301457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4301811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4302118Z     )
2025-05-07T20:32:55.4302579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4303048Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4303310Z         self,
2025-05-07T20:32:55.4303523Z         T: int,
2025-05-07T20:32:55.4303732Z         D: int,
2025-05-07T20:32:55.4303969Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4304259Z         contiguous: bool,
2025-05-07T20:32:55.4304512Z         compiled: bool,
2025-05-07T20:32:55.4304754Z     ) -> None:
2025-05-07T20:32:55.4304984Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4305237Z     
2025-05-07T20:32:55.4305529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4305895Z     
2025-05-07T20:32:55.4306106Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4306414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4306793Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4307054Z         x0 = x[:, :D]
2025-05-07T20:32:55.4307281Z         x1 = x[:, D:]
2025-05-07T20:32:55.4307512Z     
2025-05-07T20:32:55.4307714Z         if contiguous:
2025-05-07T20:32:55.4307959Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4308235Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4308495Z     
2025-05-07T20:32:55.4308743Z         if scale_ub is not None:
2025-05-07T20:32:55.4309040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4309398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4309725Z             )
2025-05-07T20:32:55.4309935Z         else:
2025-05-07T20:32:55.4310164Z             scale_ub_tensor = None
2025-05-07T20:32:55.4310429Z     
2025-05-07T20:32:55.4310681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4311018Z             op = silu_mul_quant
2025-05-07T20:32:55.4311290Z             if compiled:
2025-05-07T20:32:55.4311553Z                 op = torch.compile(op)
2025-05-07T20:32:55.4311871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4312171Z     
2025-05-07T20:32:55.4312380Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4312562Z 
2025-05-07T20:32:55.4312668Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4312989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4313344Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4313744Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4314483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4315217Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4315781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4316507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4317213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4317779Z     kernel = self.compile(
2025-05-07T20:32:55.4318355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4319051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4319477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4319726Z 
2025-05-07T20:32:55.4319946Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1a33bf40>
2025-05-07T20:32:55.4321092Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4322556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f20af0>}
2025-05-07T20:32:55.4324467Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4325570Z context = <triton._C.libtriton.ir.context object at 0x7eff09df8c30>
2025-05-07T20:32:55.4325878Z 
2025-05-07T20:32:55.4326056Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4326661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4327160Z                            module_map=module_map)
2025-05-07T20:32:55.4327543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4327919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4328345Z E       ^
2025-05-07T20:32:55.4328836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4329314Z 
2025-05-07T20:32:55.4329759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4330307Z 
2025-05-07T20:32:55.4330419Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4330929Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4331351Z     T=1,
2025-05-07T20:32:55.4331549Z     D=7168,
2025-05-07T20:32:55.4331757Z     scale_ub=None,
2025-05-07T20:32:55.4331981Z     contiguous=True,
2025-05-07T20:32:55.4332239Z     compiled=True,
2025-05-07T20:32:55.4332463Z )
2025-05-07T20:32:55.4332806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4333313Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4333594Z 
2025-05-07T20:32:55.4333678Z     @given(
2025-05-07T20:32:55.4333929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4334263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4334599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4334953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4335301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4335612Z     )
2025-05-07T20:32:55.4335988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4336459Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4336717Z         self,
2025-05-07T20:32:55.4336928Z         T: int,
2025-05-07T20:32:55.4337142Z         D: int,
2025-05-07T20:32:55.4337373Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4337665Z         contiguous: bool,
2025-05-07T20:32:55.4337927Z         compiled: bool,
2025-05-07T20:32:55.4338164Z     ) -> None:
2025-05-07T20:32:55.4338401Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4338662Z     
2025-05-07T20:32:55.4338948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4339318Z     
2025-05-07T20:32:55.4339527Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4339834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4340165Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4340425Z         x0 = x[:, :D]
2025-05-07T20:32:55.4340653Z         x1 = x[:, D:]
2025-05-07T20:32:55.4340878Z     
2025-05-07T20:32:55.4341080Z         if contiguous:
2025-05-07T20:32:55.4341325Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4341608Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4341873Z     
2025-05-07T20:32:55.4342082Z         if scale_ub is not None:
2025-05-07T20:32:55.4342373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4342730Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4343064Z             )
2025-05-07T20:32:55.4343269Z         else:
2025-05-07T20:32:55.4343499Z             scale_ub_tensor = None
2025-05-07T20:32:55.4343771Z     
2025-05-07T20:32:55.4344107Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4344448Z             op = silu_mul_quant
2025-05-07T20:32:55.4344720Z             if compiled:
2025-05-07T20:32:55.4344981Z                 op = torch.compile(op)
2025-05-07T20:32:55.4345303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4345598Z     
2025-05-07T20:32:55.4345801Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4346112Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4346430Z     
2025-05-07T20:32:55.4346682Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4347097Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4347413Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4347750Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4348176Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4348514Z     
2025-05-07T20:32:55.4348734Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4348947Z 
2025-05-07T20:32:55.4349053Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4349370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4349735Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4350132Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4350959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4351758Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4352339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4353059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4353848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4354622Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4355414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4356202Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4357026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4357701Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4358338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4358879Z     fn()
2025-05-07T20:32:55.4359422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4360046Z     self.fn.run(
2025-05-07T20:32:55.4360542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4361104Z     kernel = self.compile(
2025-05-07T20:32:55.4361677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4362371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4362788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4363039Z 
2025-05-07T20:32:55.4363260Z self = <triton.compiler.compiler.ASTSource object at 0x7eff1a2f2710>
2025-05-07T20:32:55.4364399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4365934Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea5ab0>}
2025-05-07T20:32:55.4367343Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4368424Z context = <triton._C.libtriton.ir.context object at 0x7eff099408b0>
2025-05-07T20:32:55.4368733Z 
2025-05-07T20:32:55.4368910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4369464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4369955Z                            module_map=module_map)
2025-05-07T20:32:55.4370343Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4370768Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4371049Z E       ^
2025-05-07T20:32:55.4371546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4372026Z 
2025-05-07T20:32:55.4372464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4373046Z 
2025-05-07T20:32:55.4373169Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4373605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4374031Z     T=4096,
2025-05-07T20:32:55.4374234Z     D=5120,
2025-05-07T20:32:55.4374444Z     scale_ub=None,
2025-05-07T20:32:55.4374673Z     contiguous=False,
2025-05-07T20:32:55.4374919Z     compiled=False,
2025-05-07T20:32:55.4375143Z )
2025-05-07T20:32:55.4375476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4376007Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.4376297Z 
2025-05-07T20:32:55.4376387Z     @given(
2025-05-07T20:32:55.4376636Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4376973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4377305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4377656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4378014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4378323Z     )
2025-05-07T20:32:55.4378697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4379159Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4379420Z         self,
2025-05-07T20:32:55.4379631Z         T: int,
2025-05-07T20:32:55.4379839Z         D: int,
2025-05-07T20:32:55.4380077Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4380370Z         contiguous: bool,
2025-05-07T20:32:55.4380625Z         compiled: bool,
2025-05-07T20:32:55.4380868Z     ) -> None:
2025-05-07T20:32:55.4381104Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4381359Z     
2025-05-07T20:32:55.4381657Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4382020Z     
2025-05-07T20:32:55.4382223Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4382540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4382877Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4383130Z         x0 = x[:, :D]
2025-05-07T20:32:55.4383364Z         x1 = x[:, D:]
2025-05-07T20:32:55.4383590Z     
2025-05-07T20:32:55.4383787Z         if contiguous:
2025-05-07T20:32:55.4384039Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4384316Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4384576Z     
2025-05-07T20:32:55.4384778Z         if scale_ub is not None:
2025-05-07T20:32:55.4385071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4385434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4385760Z             )
2025-05-07T20:32:55.4385985Z         else:
2025-05-07T20:32:55.4386305Z             scale_ub_tensor = None
2025-05-07T20:32:55.4386604Z     
2025-05-07T20:32:55.4386873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4387210Z             op = silu_mul_quant
2025-05-07T20:32:55.4387478Z             if compiled:
2025-05-07T20:32:55.4387747Z                 op = torch.compile(op)
2025-05-07T20:32:55.4388070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4388363Z     
2025-05-07T20:32:55.4388572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4388755Z 
2025-05-07T20:32:55.4388861Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4389182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4389534Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4389838Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4390618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4391346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4391916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4392639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4393392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4394018Z     kernel = self.compile(
2025-05-07T20:32:55.4394595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4395288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4395708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4395963Z 
2025-05-07T20:32:55.4396183Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09502fe0>
2025-05-07T20:32:55.4397327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4398781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea6d40>}
2025-05-07T20:32:55.4400200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4401279Z context = <triton._C.libtriton.ir.context object at 0x7eff097c7770>
2025-05-07T20:32:55.4401589Z 
2025-05-07T20:32:55.4401770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4402325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4402828Z                            module_map=module_map)
2025-05-07T20:32:55.4403213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4403593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4403873Z E       ^
2025-05-07T20:32:55.4404361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4404844Z 
2025-05-07T20:32:55.4405283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4405831Z 
2025-05-07T20:32:55.4405942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4406387Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4406811Z     T=4096,
2025-05-07T20:32:55.4407015Z     D=7168,
2025-05-07T20:32:55.4407224Z     scale_ub=None,
2025-05-07T20:32:55.4407449Z     contiguous=False,
2025-05-07T20:32:55.4407808Z     compiled=False,
2025-05-07T20:32:55.4408035Z )
2025-05-07T20:32:55.4408370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4408897Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.4409195Z 
2025-05-07T20:32:55.4409280Z     @given(
2025-05-07T20:32:55.4409527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4409854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4410184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4410537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4410883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4416908Z     )
2025-05-07T20:32:55.4417301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4417855Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4418118Z         self,
2025-05-07T20:32:55.4418327Z         T: int,
2025-05-07T20:32:55.4418541Z         D: int,
2025-05-07T20:32:55.4418776Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4419068Z         contiguous: bool,
2025-05-07T20:32:55.4419318Z         compiled: bool,
2025-05-07T20:32:55.4419609Z     ) -> None:
2025-05-07T20:32:55.4419844Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4420105Z     
2025-05-07T20:32:55.4420391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4420758Z     
2025-05-07T20:32:55.4420969Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4421280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4421614Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4421876Z         x0 = x[:, :D]
2025-05-07T20:32:55.4422107Z         x1 = x[:, D:]
2025-05-07T20:32:55.4422336Z     
2025-05-07T20:32:55.4422541Z         if contiguous:
2025-05-07T20:32:55.4422788Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4423068Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4423327Z     
2025-05-07T20:32:55.4423534Z         if scale_ub is not None:
2025-05-07T20:32:55.4424099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4424562Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4424894Z             )
2025-05-07T20:32:55.4425102Z         else:
2025-05-07T20:32:55.4425327Z             scale_ub_tensor = None
2025-05-07T20:32:55.4425591Z     
2025-05-07T20:32:55.4425841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4426181Z             op = silu_mul_quant
2025-05-07T20:32:55.4426450Z             if compiled:
2025-05-07T20:32:55.4426709Z                 op = torch.compile(op)
2025-05-07T20:32:55.4427030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4427325Z     
2025-05-07T20:32:55.4427530Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4427717Z 
2025-05-07T20:32:55.4427822Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4428144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4428497Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4428801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4429539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4430272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4430839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4431565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4432264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4432827Z     kernel = self.compile(
2025-05-07T20:32:55.4433405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4434374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4434799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4435042Z 
2025-05-07T20:32:55.4435261Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09eb7820>
2025-05-07T20:32:55.4436394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4437841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea7c70>}
2025-05-07T20:32:55.4439256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4440401Z context = <triton._C.libtriton.ir.context object at 0x7eff092aeeb0>
2025-05-07T20:32:55.4440704Z 
2025-05-07T20:32:55.4440880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4441504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4442003Z                            module_map=module_map)
2025-05-07T20:32:55.4442385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4442763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4443044Z E       ^
2025-05-07T20:32:55.4443536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4444011Z 
2025-05-07T20:32:55.4444452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4444994Z 
2025-05-07T20:32:55.4445111Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4445550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4445971Z     T=128,
2025-05-07T20:32:55.4446172Z     D=7168,
2025-05-07T20:32:55.4446382Z     scale_ub=None,
2025-05-07T20:32:55.4446613Z     contiguous=False,
2025-05-07T20:32:55.4446850Z     compiled=True,
2025-05-07T20:32:55.4447069Z )
2025-05-07T20:32:55.4447408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4447922Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.4448210Z 
2025-05-07T20:32:55.4448294Z     @given(
2025-05-07T20:32:55.4448542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4448870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4449203Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4449557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4449908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4450216Z     )
2025-05-07T20:32:55.4450592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4451065Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4451328Z         self,
2025-05-07T20:32:55.4451539Z         T: int,
2025-05-07T20:32:55.4451754Z         D: int,
2025-05-07T20:32:55.4451986Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4452278Z         contiguous: bool,
2025-05-07T20:32:55.4452535Z         compiled: bool,
2025-05-07T20:32:55.4452772Z     ) -> None:
2025-05-07T20:32:55.4453005Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4453267Z     
2025-05-07T20:32:55.4453552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4453923Z     
2025-05-07T20:32:55.4454140Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4454450Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4454785Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4455137Z         x0 = x[:, :D]
2025-05-07T20:32:55.4455369Z         x1 = x[:, D:]
2025-05-07T20:32:55.4455598Z     
2025-05-07T20:32:55.4455797Z         if contiguous:
2025-05-07T20:32:55.4456046Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4456320Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4456578Z     
2025-05-07T20:32:55.4456784Z         if scale_ub is not None:
2025-05-07T20:32:55.4457072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4457427Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4457757Z             )
2025-05-07T20:32:55.4457961Z         else:
2025-05-07T20:32:55.4458187Z             scale_ub_tensor = None
2025-05-07T20:32:55.4458459Z     
2025-05-07T20:32:55.4458703Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4459091Z             op = silu_mul_quant
2025-05-07T20:32:55.4459360Z             if compiled:
2025-05-07T20:32:55.4459624Z                 op = torch.compile(op)
2025-05-07T20:32:55.4459948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4460244Z     
2025-05-07T20:32:55.4460446Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4460757Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4461120Z     
2025-05-07T20:32:55.4461378Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4461736Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4462053Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4462389Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4462772Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4463108Z     
2025-05-07T20:32:55.4463327Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4463538Z 
2025-05-07T20:32:55.4463645Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4463969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4464337Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4464696Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4465524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4466321Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4466906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4467625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4468353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4469124Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4469931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4470720Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4471496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4472180Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4472821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4473367Z     fn()
2025-05-07T20:32:55.4473981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4474595Z     self.fn.run(
2025-05-07T20:32:55.4475087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4475651Z     kernel = self.compile(
2025-05-07T20:32:55.4476311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4477003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4477421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4477674Z 
2025-05-07T20:32:55.4477893Z self = <triton.compiler.compiler.ASTSource object at 0x7eff097ab1f0>
2025-05-07T20:32:55.4479031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4480481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09ea7ac0>}
2025-05-07T20:32:55.4481944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4483027Z context = <triton._C.libtriton.ir.context object at 0x7eff091b98f0>
2025-05-07T20:32:55.4483379Z 
2025-05-07T20:32:55.4483556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4484111Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4484606Z                            module_map=module_map)
2025-05-07T20:32:55.4484994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4485374Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4485662Z E       ^
2025-05-07T20:32:55.4486148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4486634Z 
2025-05-07T20:32:55.4487076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4487615Z 
2025-05-07T20:32:55.4487734Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4488168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4488596Z     T=128,
2025-05-07T20:32:55.4488798Z     D=7168,
2025-05-07T20:32:55.4489007Z     scale_ub=None,
2025-05-07T20:32:55.4489233Z     contiguous=False,
2025-05-07T20:32:55.4489478Z     compiled=False,
2025-05-07T20:32:55.4489701Z )
2025-05-07T20:32:55.4490036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4490561Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.4490845Z 
2025-05-07T20:32:55.4490935Z     @given(
2025-05-07T20:32:55.4491178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4491520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4491850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4492202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4492554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4492862Z     )
2025-05-07T20:32:55.4493238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4493707Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4493968Z         self,
2025-05-07T20:32:55.4494182Z         T: int,
2025-05-07T20:32:55.4494390Z         D: int,
2025-05-07T20:32:55.4494629Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4494922Z         contiguous: bool,
2025-05-07T20:32:55.4495184Z         compiled: bool,
2025-05-07T20:32:55.4495425Z     ) -> None:
2025-05-07T20:32:55.4495653Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4495918Z     
2025-05-07T20:32:55.4496216Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4496575Z     
2025-05-07T20:32:55.4496792Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4497222Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4497553Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4497811Z         x0 = x[:, :D]
2025-05-07T20:32:55.4498048Z         x1 = x[:, D:]
2025-05-07T20:32:55.4498269Z     
2025-05-07T20:32:55.4498472Z         if contiguous:
2025-05-07T20:32:55.4498723Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4499002Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4499256Z     
2025-05-07T20:32:55.4499464Z         if scale_ub is not None:
2025-05-07T20:32:55.4499759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4500111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4500442Z             )
2025-05-07T20:32:55.4500651Z         else:
2025-05-07T20:32:55.4500871Z             scale_ub_tensor = None
2025-05-07T20:32:55.4501188Z     
2025-05-07T20:32:55.4501438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4501768Z             op = silu_mul_quant
2025-05-07T20:32:55.4502041Z             if compiled:
2025-05-07T20:32:55.4502310Z                 op = torch.compile(op)
2025-05-07T20:32:55.4502623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4502917Z     
2025-05-07T20:32:55.4503171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4503346Z 
2025-05-07T20:32:55.4503458Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4503772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4504132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4504437Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4505166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4505903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4506474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4507201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4507899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4508471Z     kernel = self.compile(
2025-05-07T20:32:55.4509046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4509737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4510163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4510414Z 
2025-05-07T20:32:55.4510636Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09500ee0>
2025-05-07T20:32:55.4511786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4513239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff1a0cbb50>}
2025-05-07T20:32:55.4514744Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4515828Z context = <triton._C.libtriton.ir.context object at 0x7eff091e5af0>
2025-05-07T20:32:55.4516133Z 
2025-05-07T20:32:55.4516317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4516888Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4517400Z                            module_map=module_map)
2025-05-07T20:32:55.4517789Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4518250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4518525Z E       ^
2025-05-07T20:32:55.4519016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4519495Z 
2025-05-07T20:32:55.4519937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4520474Z 
2025-05-07T20:32:55.4520593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4521026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4521451Z     T=4096,
2025-05-07T20:32:55.4521659Z     D=5120,
2025-05-07T20:32:55.4521861Z     scale_ub=1200.0,
2025-05-07T20:32:55.4522102Z     contiguous=True,
2025-05-07T20:32:55.4522341Z     compiled=False,
2025-05-07T20:32:55.4522622Z )
2025-05-07T20:32:55.4522962Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4523492Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.4523982Z 
2025-05-07T20:32:55.4524115Z     @given(
2025-05-07T20:32:55.4524442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4524790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4525258Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4525604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4525956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4526260Z     )
2025-05-07T20:32:55.4526627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4527094Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4527353Z         self,
2025-05-07T20:32:55.4527558Z         T: int,
2025-05-07T20:32:55.4527773Z         D: int,
2025-05-07T20:32:55.4528013Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4528298Z         contiguous: bool,
2025-05-07T20:32:55.4528558Z         compiled: bool,
2025-05-07T20:32:55.4528802Z     ) -> None:
2025-05-07T20:32:55.4529038Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4529292Z     
2025-05-07T20:32:55.4529581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4529948Z     
2025-05-07T20:32:55.4530153Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4530462Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4530791Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4531042Z         x0 = x[:, :D]
2025-05-07T20:32:55.4531274Z         x1 = x[:, D:]
2025-05-07T20:32:55.4531497Z     
2025-05-07T20:32:55.4531692Z         if contiguous:
2025-05-07T20:32:55.4531941Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4532218Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4532473Z     
2025-05-07T20:32:55.4532688Z         if scale_ub is not None:
2025-05-07T20:32:55.4532983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4533335Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4533666Z             )
2025-05-07T20:32:55.4533874Z         else:
2025-05-07T20:32:55.4534094Z             scale_ub_tensor = None
2025-05-07T20:32:55.4534364Z     
2025-05-07T20:32:55.4534620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4534957Z             op = silu_mul_quant
2025-05-07T20:32:55.4535218Z             if compiled:
2025-05-07T20:32:55.4535482Z                 op = torch.compile(op)
2025-05-07T20:32:55.4535800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4536089Z     
2025-05-07T20:32:55.4536300Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4536476Z 
2025-05-07T20:32:55.4536588Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4536902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4537264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4537570Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4538442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4539164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4539728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4540449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4541143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4541706Z     kernel = self.compile(
2025-05-07T20:32:55.4542275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4542965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4543444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4543692Z 
2025-05-07T20:32:55.4543916Z self = <triton.compiler.compiler.ASTSource object at 0x7eff09503940>
2025-05-07T20:32:55.4545045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4546528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0994a0e0>}
2025-05-07T20:32:55.4548011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4549096Z context = <triton._C.libtriton.ir.context object at 0x7eff08c56bb0>
2025-05-07T20:32:55.4549407Z 
2025-05-07T20:32:55.4549584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4550143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4550637Z                            module_map=module_map)
2025-05-07T20:32:55.4551029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4551411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4551688Z E       ^
2025-05-07T20:32:55.4552173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4552652Z 
2025-05-07T20:32:55.4553088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4553692Z 
2025-05-07T20:32:55.4553812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4554251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4554676Z     T=1,
2025-05-07T20:32:55.4554874Z     D=5120,
2025-05-07T20:32:55.4555088Z     scale_ub=None,
2025-05-07T20:32:55.4555312Z     contiguous=True,
2025-05-07T20:32:55.4555554Z     compiled=True,
2025-05-07T20:32:55.4555776Z )
2025-05-07T20:32:55.4556111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4556631Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4556929Z 
2025-05-07T20:32:55.4557018Z     @given(
2025-05-07T20:32:55.4557258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4557590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4557920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4558265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4558617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4558926Z     )
2025-05-07T20:32:55.4559301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4559846Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4560108Z         self,
2025-05-07T20:32:55.4560318Z         T: int,
2025-05-07T20:32:55.4560522Z         D: int,
2025-05-07T20:32:55.4560755Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4560854Z         contiguous: bool,
2025-05-07T20:32:55.4560954Z         compiled: bool,
2025-05-07T20:32:55.4561039Z     ) -> None:
2025-05-07T20:32:55.4561142Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4561226Z     
2025-05-07T20:32:55.4561409Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4561489Z     
2025-05-07T20:32:55.4561594Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4561728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4561829Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4561959Z         x0 = x[:, :D]
2025-05-07T20:32:55.4562045Z         x1 = x[:, D:]
2025-05-07T20:32:55.4562133Z     
2025-05-07T20:32:55.4562223Z         if contiguous:
2025-05-07T20:32:55.4562325Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4562427Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4562507Z     
2025-05-07T20:32:55.4562604Z         if scale_ub is not None:
2025-05-07T20:32:55.4562723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4562933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4563015Z             )
2025-05-07T20:32:55.4563102Z         else:
2025-05-07T20:32:55.4563203Z             scale_ub_tensor = None
2025-05-07T20:32:55.4563281Z     
2025-05-07T20:32:55.4563423Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4563520Z             op = silu_mul_quant
2025-05-07T20:32:55.4563617Z             if compiled:
2025-05-07T20:32:55.4563723Z                 op = torch.compile(op)
2025-05-07T20:32:55.4563838Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4563924Z     
2025-05-07T20:32:55.4564021Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4564156Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4564242Z     
2025-05-07T20:32:55.4564386Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4564495Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4564611Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4564741Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4564899Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4564979Z     
2025-05-07T20:32:55.4565085Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4565090Z 
2025-05-07T20:32:55.4565199Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4565339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4565453Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4565604Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4566195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4566311Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4566689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4566945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4567376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4567669Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4568090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4568368Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4568842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4569030Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4569393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4569480Z     fn()
2025-05-07T20:32:55.4569910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4569999Z     self.fn.run(
2025-05-07T20:32:55.4570363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4570464Z     kernel = self.compile(
2025-05-07T20:32:55.4570873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4571106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4576900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4576913Z 
2025-05-07T20:32:55.4577162Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0944d720>
2025-05-07T20:32:55.4577993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4578621Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff09f20ee0>}
2025-05-07T20:32:55.4579411Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4579619Z context = <triton._C.libtriton.ir.context object at 0x7eff08c771f0>
2025-05-07T20:32:55.4579636Z 
2025-05-07T20:32:55.4579814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4580095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4580219Z                            module_map=module_map)
2025-05-07T20:32:55.4580393Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4580506Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4580598Z E       ^
2025-05-07T20:32:55.4580974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4580980Z 
2025-05-07T20:32:55.4581425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4581433Z 
2025-05-07T20:32:55.4581545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4581785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4581875Z     T=2048,
2025-05-07T20:32:55.4581957Z     D=5120,
2025-05-07T20:32:55.4582045Z     scale_ub=None,
2025-05-07T20:32:55.4582144Z     contiguous=True,
2025-05-07T20:32:55.4582235Z     compiled=True,
2025-05-07T20:32:55.4582314Z )
2025-05-07T20:32:55.4582550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4582731Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4582736Z 
2025-05-07T20:32:55.4582826Z     @given(
2025-05-07T20:32:55.4582954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4583061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4583189Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4583317Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4583440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4583526Z     )
2025-05-07T20:32:55.4583882Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4583991Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4584075Z         self,
2025-05-07T20:32:55.4584156Z         T: int,
2025-05-07T20:32:55.4584248Z         D: int,
2025-05-07T20:32:55.4584353Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4584450Z         contiguous: bool,
2025-05-07T20:32:55.4584548Z         compiled: bool,
2025-05-07T20:32:55.4584636Z     ) -> None:
2025-05-07T20:32:55.4584737Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4584821Z     
2025-05-07T20:32:55.4585003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4585083Z     
2025-05-07T20:32:55.4585187Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4585323Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4585464Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4585558Z         x0 = x[:, :D]
2025-05-07T20:32:55.4585644Z         x1 = x[:, D:]
2025-05-07T20:32:55.4585734Z     
2025-05-07T20:32:55.4585824Z         if contiguous:
2025-05-07T20:32:55.4585924Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4586025Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4586147Z     
2025-05-07T20:32:55.4586245Z         if scale_ub is not None:
2025-05-07T20:32:55.4586366Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4586511Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4586593Z             )
2025-05-07T20:32:55.4586679Z         else:
2025-05-07T20:32:55.4586779Z             scale_ub_tensor = None
2025-05-07T20:32:55.4586858Z     
2025-05-07T20:32:55.4587007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4587104Z             op = silu_mul_quant
2025-05-07T20:32:55.4587205Z             if compiled:
2025-05-07T20:32:55.4587312Z                 op = torch.compile(op)
2025-05-07T20:32:55.4587426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4587510Z     
2025-05-07T20:32:55.4587612Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4587742Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4587827Z     
2025-05-07T20:32:55.4587972Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4588085Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4588200Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4588331Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4588481Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4588566Z     
2025-05-07T20:32:55.4588674Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4588678Z 
2025-05-07T20:32:55.4588789Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4588932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4589047Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4589202Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4589797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4589909Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4590301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4590540Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4590940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4591212Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4591639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4591998Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4592397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4592584Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4592952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4593037Z     fn()
2025-05-07T20:32:55.4593468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4593650Z     self.fn.run(
2025-05-07T20:32:55.4594010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4594164Z     kernel = self.compile(
2025-05-07T20:32:55.4594569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4594767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4594905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4594909Z 
2025-05-07T20:32:55.4595173Z self = <triton.compiler.compiler.ASTSource object at 0x7eff094285e0>
2025-05-07T20:32:55.4595999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4596537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949e7a0>}
2025-05-07T20:32:55.4597328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4597540Z context = <triton._C.libtriton.ir.context object at 0x7eff08dd7470>
2025-05-07T20:32:55.4597545Z 
2025-05-07T20:32:55.4597730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4598020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4598138Z                            module_map=module_map)
2025-05-07T20:32:55.4598320Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4598429Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4598513Z E       ^
2025-05-07T20:32:55.4598894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4598899Z 
2025-05-07T20:32:55.4599339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4599343Z 
2025-05-07T20:32:55.4599465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4599700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4599785Z     T=128,
2025-05-07T20:32:55.4599874Z     D=5120,
2025-05-07T20:32:55.4599968Z     scale_ub=None,
2025-05-07T20:32:55.4600059Z     contiguous=True,
2025-05-07T20:32:55.4600157Z     compiled=True,
2025-05-07T20:32:55.4600235Z )
2025-05-07T20:32:55.4600467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4600652Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4600656Z 
2025-05-07T20:32:55.4600740Z     @given(
2025-05-07T20:32:55.4600873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4600980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4601106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4601238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4601441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4601523Z     )
2025-05-07T20:32:55.4601791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4601892Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4601984Z         self,
2025-05-07T20:32:55.4602067Z         T: int,
2025-05-07T20:32:55.4602149Z         D: int,
2025-05-07T20:32:55.4602259Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4602355Z         contiguous: bool,
2025-05-07T20:32:55.4602447Z         compiled: bool,
2025-05-07T20:32:55.4602537Z     ) -> None:
2025-05-07T20:32:55.4602639Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4602718Z     
2025-05-07T20:32:55.4602905Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4603059Z     
2025-05-07T20:32:55.4603158Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4603299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4603401Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4603491Z         x0 = x[:, :D]
2025-05-07T20:32:55.4603583Z         x1 = x[:, D:]
2025-05-07T20:32:55.4603661Z     
2025-05-07T20:32:55.4603759Z         if contiguous:
2025-05-07T20:32:55.4603859Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4603999Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4604088Z     
2025-05-07T20:32:55.4604185Z         if scale_ub is not None:
2025-05-07T20:32:55.4604298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4604450Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4604531Z             )
2025-05-07T20:32:55.4604616Z         else:
2025-05-07T20:32:55.4604725Z             scale_ub_tensor = None
2025-05-07T20:32:55.4604806Z     
2025-05-07T20:32:55.4604944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4605050Z             op = silu_mul_quant
2025-05-07T20:32:55.4605141Z             if compiled:
2025-05-07T20:32:55.4605259Z                 op = torch.compile(op)
2025-05-07T20:32:55.4605374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4605453Z     
2025-05-07T20:32:55.4605554Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4605686Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4605768Z     
2025-05-07T20:32:55.4605921Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4606030Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4606137Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4606274Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4606424Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4606503Z     
2025-05-07T20:32:55.4606618Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4606626Z 
2025-05-07T20:32:55.4606734Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4606892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4607028Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4607173Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4607770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4607880Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4608260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4608503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4608889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4609169Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4609751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4610022Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4610423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4610605Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4610976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4611059Z     fn()
2025-05-07T20:32:55.4611481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4611576Z     self.fn.run(
2025-05-07T20:32:55.4611932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4612077Z     kernel = self.compile(
2025-05-07T20:32:55.4612492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4612679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4612819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4612864Z 
2025-05-07T20:32:55.4613081Z self = <triton.compiler.compiler.ASTSource object at 0x7eff097a8a30>
2025-05-07T20:32:55.4613898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4614434Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949f640>}
2025-05-07T20:32:55.4615218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4615426Z context = <triton._C.libtriton.ir.context object at 0x7eff0897a430>
2025-05-07T20:32:55.4615434Z 
2025-05-07T20:32:55.4615609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4615892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4616007Z                            module_map=module_map)
2025-05-07T20:32:55.4616180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4616296Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4616379Z E       ^
2025-05-07T20:32:55.4616753Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4616760Z 
2025-05-07T20:32:55.4617201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4617206Z 
2025-05-07T20:32:55.4617316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4617554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4617639Z     T=4096,
2025-05-07T20:32:55.4617720Z     D=5120,
2025-05-07T20:32:55.4617813Z     scale_ub=None,
2025-05-07T20:32:55.4617903Z     contiguous=True,
2025-05-07T20:32:55.4617995Z     compiled=True,
2025-05-07T20:32:55.4618080Z )
2025-05-07T20:32:55.4618308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4618490Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4618504Z 
2025-05-07T20:32:55.4618585Z     @given(
2025-05-07T20:32:55.4618715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4618826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4618948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4619180Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4619311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4619391Z     )
2025-05-07T20:32:55.4619650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4619760Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4619843Z         self,
2025-05-07T20:32:55.4619925Z         T: int,
2025-05-07T20:32:55.4620012Z         D: int,
2025-05-07T20:32:55.4620117Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4620218Z         contiguous: bool,
2025-05-07T20:32:55.4620309Z         compiled: bool,
2025-05-07T20:32:55.4620394Z     ) -> None:
2025-05-07T20:32:55.4620501Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4620579Z     
2025-05-07T20:32:55.4620800Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4620889Z     
2025-05-07T20:32:55.4620986Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4621124Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4621223Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4621308Z         x0 = x[:, :D]
2025-05-07T20:32:55.4621394Z         x1 = x[:, D:]
2025-05-07T20:32:55.4621519Z     
2025-05-07T20:32:55.4621611Z         if contiguous:
2025-05-07T20:32:55.4621716Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4621812Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4621891Z     
2025-05-07T20:32:55.4621993Z         if scale_ub is not None:
2025-05-07T20:32:55.4622109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4622253Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4622341Z             )
2025-05-07T20:32:55.4622422Z         else:
2025-05-07T20:32:55.4622524Z             scale_ub_tensor = None
2025-05-07T20:32:55.4622610Z     
2025-05-07T20:32:55.4622747Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4622844Z             op = silu_mul_quant
2025-05-07T20:32:55.4622946Z             if compiled:
2025-05-07T20:32:55.4623056Z                 op = torch.compile(op)
2025-05-07T20:32:55.4623178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4623261Z     
2025-05-07T20:32:55.4623357Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4623491Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4623570Z     
2025-05-07T20:32:55.4623716Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4624314Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4624474Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4624613Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4624771Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4624855Z     
2025-05-07T20:32:55.4624963Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4624976Z 
2025-05-07T20:32:55.4625082Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4625223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4625341Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4625486Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4626074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4626190Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4626601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4626860Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4627247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4627519Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4628174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4628446Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4628843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4629026Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4629385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4629475Z     fn()
2025-05-07T20:32:55.4629895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4630041Z     self.fn.run(
2025-05-07T20:32:55.4630403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4630510Z     kernel = self.compile(
2025-05-07T20:32:55.4630915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4631101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4631303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4631308Z 
2025-05-07T20:32:55.4631532Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0831ed10>
2025-05-07T20:32:55.4632350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4632893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0888f5b0>}
2025-05-07T20:32:55.4633738Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4633945Z context = <triton._C.libtriton.ir.context object at 0x7eff08268830>
2025-05-07T20:32:55.4633950Z 
2025-05-07T20:32:55.4634129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4634404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4634525Z                            module_map=module_map)
2025-05-07T20:32:55.4634697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4634806Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4634896Z E       ^
2025-05-07T20:32:55.4635269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4635278Z 
2025-05-07T20:32:55.4635712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4635722Z 
2025-05-07T20:32:55.4635835Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4636071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4636159Z     T=16384,
2025-05-07T20:32:55.4636241Z     D=5120,
2025-05-07T20:32:55.4636329Z     scale_ub=None,
2025-05-07T20:32:55.4636425Z     contiguous=True,
2025-05-07T20:32:55.4636513Z     compiled=True,
2025-05-07T20:32:55.4636591Z )
2025-05-07T20:32:55.4636826Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4637021Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4637031Z 
2025-05-07T20:32:55.4637127Z     @given(
2025-05-07T20:32:55.4637279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4637497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4637629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4637753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4637878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4637963Z     )
2025-05-07T20:32:55.4638224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4638325Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4638414Z         self,
2025-05-07T20:32:55.4638498Z         T: int,
2025-05-07T20:32:55.4638579Z         D: int,
2025-05-07T20:32:55.4638689Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4638785Z         contiguous: bool,
2025-05-07T20:32:55.4638883Z         compiled: bool,
2025-05-07T20:32:55.4639009Z     ) -> None:
2025-05-07T20:32:55.4639110Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4639192Z     
2025-05-07T20:32:55.4639370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4639454Z     
2025-05-07T20:32:55.4639557Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4639689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4639783Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4639919Z         x0 = x[:, :D]
2025-05-07T20:32:55.4640005Z         x1 = x[:, D:]
2025-05-07T20:32:55.4640084Z     
2025-05-07T20:32:55.4640179Z         if contiguous:
2025-05-07T20:32:55.4640277Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4640376Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4640454Z     
2025-05-07T20:32:55.4640551Z         if scale_ub is not None:
2025-05-07T20:32:55.4640673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4640817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4640900Z             )
2025-05-07T20:32:55.4640987Z         else:
2025-05-07T20:32:55.4641088Z             scale_ub_tensor = None
2025-05-07T20:32:55.4641166Z     
2025-05-07T20:32:55.4641317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4641413Z             op = silu_mul_quant
2025-05-07T20:32:55.4641505Z             if compiled:
2025-05-07T20:32:55.4641618Z                 op = torch.compile(op)
2025-05-07T20:32:55.4641735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4641820Z     
2025-05-07T20:32:55.4641917Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4642046Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4642134Z     
2025-05-07T20:32:55.4642277Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4642384Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4642507Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4642636Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4642786Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4642872Z     
2025-05-07T20:32:55.4642987Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4642992Z 
2025-05-07T20:32:55.4643106Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4643240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4643356Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4643506Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4644091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4644200Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4644586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4644821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4645218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4645575Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4645998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4646283Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4646673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4646856Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4647214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4647298Z     fn()
2025-05-07T20:32:55.4647762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4647850Z     self.fn.run(
2025-05-07T20:32:55.4648213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4648319Z     kernel = self.compile(
2025-05-07T20:32:55.4648717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4648949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4649085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4649089Z 
2025-05-07T20:32:55.4649306Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08d383a0>
2025-05-07T20:32:55.4650126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4650662Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0949ec20>}
2025-05-07T20:32:55.4651446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4651653Z context = <triton._C.libtriton.ir.context object at 0x7efdf7ff55f0>
2025-05-07T20:32:55.4651658Z 
2025-05-07T20:32:55.4651834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4652116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4652233Z                            module_map=module_map)
2025-05-07T20:32:55.4652409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4652521Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4652603Z E       ^
2025-05-07T20:32:55.4652984Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4652988Z 
2025-05-07T20:32:55.4653421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4653428Z 
2025-05-07T20:32:55.4653546Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4653779Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4653861Z     T=1,
2025-05-07T20:32:55.4653948Z     D=5120,
2025-05-07T20:32:55.4654038Z     scale_ub=1200.0,
2025-05-07T20:32:55.4654128Z     contiguous=True,
2025-05-07T20:32:55.4654222Z     compiled=True,
2025-05-07T20:32:55.4654301Z )
2025-05-07T20:32:55.4654528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4654711Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.4654716Z 
2025-05-07T20:32:55.4654800Z     @given(
2025-05-07T20:32:55.4655016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4655126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4655250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4655384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4655506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4655585Z     )
2025-05-07T20:32:55.4655851Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4655952Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4656034Z         self,
2025-05-07T20:32:55.4656126Z         T: int,
2025-05-07T20:32:55.4656207Z         D: int,
2025-05-07T20:32:55.4656317Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4656455Z         contiguous: bool,
2025-05-07T20:32:55.4656548Z         compiled: bool,
2025-05-07T20:32:55.4656638Z     ) -> None:
2025-05-07T20:32:55.4656739Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4656823Z     
2025-05-07T20:32:55.4657006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4657085Z     
2025-05-07T20:32:55.4657185Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4657368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4657461Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4657547Z         x0 = x[:, :D]
2025-05-07T20:32:55.4657638Z         x1 = x[:, D:]
2025-05-07T20:32:55.4657718Z     
2025-05-07T20:32:55.4657813Z         if contiguous:
2025-05-07T20:32:55.4657917Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4658011Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4658096Z     
2025-05-07T20:32:55.4658191Z         if scale_ub is not None:
2025-05-07T20:32:55.4658304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4658455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4658536Z             )
2025-05-07T20:32:55.4658617Z         else:
2025-05-07T20:32:55.4658730Z             scale_ub_tensor = None
2025-05-07T20:32:55.4658808Z     
2025-05-07T20:32:55.4658947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4659051Z             op = silu_mul_quant
2025-05-07T20:32:55.4659145Z             if compiled:
2025-05-07T20:32:55.4659251Z                 op = torch.compile(op)
2025-05-07T20:32:55.4659370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4659448Z     
2025-05-07T20:32:55.4659552Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4659556Z 
2025-05-07T20:32:55.4659665Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4659803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4659916Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4660025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4660413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4660524Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4661042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4661156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4661532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4661769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4662134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4662236Z     kernel = self.compile(
2025-05-07T20:32:55.4662635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4662830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4663047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4663052Z 
2025-05-07T20:32:55.4663277Z self = <triton.compiler.compiler.ASTSource object at 0x7eff087873a0>
2025-05-07T20:32:55.4664082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4664620Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08d2ac20>}
2025-05-07T20:32:55.4665396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4665641Z context = <triton._C.libtriton.ir.context object at 0x7efdf7ef1270>
2025-05-07T20:32:55.4665645Z 
2025-05-07T20:32:55.4665835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4666114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4666273Z                            module_map=module_map)
2025-05-07T20:32:55.4666448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4666554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4666643Z E       ^
2025-05-07T20:32:55.4667015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4667020Z 
2025-05-07T20:32:55.4667451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4667466Z 
2025-05-07T20:32:55.4667576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4667812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4667907Z     T=1,
2025-05-07T20:32:55.4667990Z     D=5120,
2025-05-07T20:32:55.4668080Z     scale_ub=None,
2025-05-07T20:32:55.4668178Z     contiguous=False,
2025-05-07T20:32:55.4668266Z     compiled=True,
2025-05-07T20:32:55.4668348Z )
2025-05-07T20:32:55.4668584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4668759Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.4668764Z 
2025-05-07T20:32:55.4668850Z     @given(
2025-05-07T20:32:55.4668985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4669091Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4669219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4669343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4669471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4669558Z     )
2025-05-07T20:32:55.4669820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4669920Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4670008Z         self,
2025-05-07T20:32:55.4670092Z         T: int,
2025-05-07T20:32:55.4670177Z         D: int,
2025-05-07T20:32:55.4670289Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4670384Z         contiguous: bool,
2025-05-07T20:32:55.4670483Z         compiled: bool,
2025-05-07T20:32:55.4670567Z     ) -> None:
2025-05-07T20:32:55.4670670Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4670755Z     
2025-05-07T20:32:55.4670934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4671012Z     
2025-05-07T20:32:55.4671117Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4671249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4671346Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4671443Z         x0 = x[:, :D]
2025-05-07T20:32:55.4671528Z         x1 = x[:, D:]
2025-05-07T20:32:55.4671606Z     
2025-05-07T20:32:55.4671812Z         if contiguous:
2025-05-07T20:32:55.4671912Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4672009Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4672095Z     
2025-05-07T20:32:55.4672196Z         if scale_ub is not None:
2025-05-07T20:32:55.4672316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4672458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4672539Z             )
2025-05-07T20:32:55.4672630Z         else:
2025-05-07T20:32:55.4672731Z             scale_ub_tensor = None
2025-05-07T20:32:55.4672810Z     
2025-05-07T20:32:55.4672952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4673048Z             op = silu_mul_quant
2025-05-07T20:32:55.4673138Z             if compiled:
2025-05-07T20:32:55.4673293Z                 op = torch.compile(op)
2025-05-07T20:32:55.4673406Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4673487Z     
2025-05-07T20:32:55.4673726Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4673856Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4673939Z     
2025-05-07T20:32:55.4674084Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4674825Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4674938Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4675069Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4675219Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4675301Z     
2025-05-07T20:32:55.4675408Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4675413Z 
2025-05-07T20:32:55.4675521Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4675666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4675782Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4675931Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4676524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4676635Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4677023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4677263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4677653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4677920Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4678338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4678619Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4679012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4679190Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4679558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4679642Z     fn()
2025-05-07T20:32:55.4680068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4680157Z     self.fn.run(
2025-05-07T20:32:55.4680510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4680617Z     kernel = self.compile(
2025-05-07T20:32:55.4681018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4681288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4681435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4681440Z 
2025-05-07T20:32:55.4681656Z self = <triton.compiler.compiler.ASTSource object at 0x7eff0843edd0>
2025-05-07T20:32:55.4682478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4683011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0837f370>}
2025-05-07T20:32:55.4683794Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4684045Z context = <triton._C.libtriton.ir.context object at 0x7efdf7e89d70>
2025-05-07T20:32:55.4684049Z 
2025-05-07T20:32:55.4684228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4684557Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4684673Z                            module_map=module_map)
2025-05-07T20:32:55.4684855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4684967Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4685049Z E       ^
2025-05-07T20:32:55.4685430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4685435Z 
2025-05-07T20:32:55.4685871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4685875Z 
2025-05-07T20:32:55.4685996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4686231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4686313Z     T=1,
2025-05-07T20:32:55.4686401Z     D=5120,
2025-05-07T20:32:55.4686491Z     scale_ub=None,
2025-05-07T20:32:55.4686585Z     contiguous=True,
2025-05-07T20:32:55.4686681Z     compiled=False,
2025-05-07T20:32:55.4686762Z )
2025-05-07T20:32:55.4686990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4687170Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.4687175Z 
2025-05-07T20:32:55.4687257Z     @given(
2025-05-07T20:32:55.4687382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4687497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4687624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4687754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4687878Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4687957Z     )
2025-05-07T20:32:55.4688222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4688321Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4688406Z         self,
2025-05-07T20:32:55.4688494Z         T: int,
2025-05-07T20:32:55.4688577Z         D: int,
2025-05-07T20:32:55.4688682Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4688784Z         contiguous: bool,
2025-05-07T20:32:55.4688877Z         compiled: bool,
2025-05-07T20:32:55.4688961Z     ) -> None:
2025-05-07T20:32:55.4689067Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4689148Z     
2025-05-07T20:32:55.4689332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4689411Z     
2025-05-07T20:32:55.4689513Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4689654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4689748Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4689917Z         x0 = x[:, :D]
2025-05-07T20:32:55.4690011Z         x1 = x[:, D:]
2025-05-07T20:32:55.4690089Z     
2025-05-07T20:32:55.4690180Z         if contiguous:
2025-05-07T20:32:55.4690287Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4690384Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4690460Z     
2025-05-07T20:32:55.4690562Z         if scale_ub is not None:
2025-05-07T20:32:55.4690675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4690825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4690906Z             )
2025-05-07T20:32:55.4690986Z         else:
2025-05-07T20:32:55.4691094Z             scale_ub_tensor = None
2025-05-07T20:32:55.4691172Z     
2025-05-07T20:32:55.4691309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4691476Z             op = silu_mul_quant
2025-05-07T20:32:55.4691566Z             if compiled:
2025-05-07T20:32:55.4691672Z                 op = torch.compile(op)
2025-05-07T20:32:55.4691796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4691877Z     
2025-05-07T20:32:55.4691973Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4691985Z 
2025-05-07T20:32:55.4692088Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4692268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4692382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4692489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4693014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4693129Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4693505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4693745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4694114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4694215Z     kernel = self.compile(
2025-05-07T20:32:55.4694624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4694814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4694947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4694952Z 
2025-05-07T20:32:55.4695178Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08b2d240>
2025-05-07T20:32:55.4695985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4696527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff0837feb0>}
2025-05-07T20:32:55.4697303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4697515Z context = <triton._C.libtriton.ir.context object at 0x7efdf78b3830>
2025-05-07T20:32:55.4697519Z 
2025-05-07T20:32:55.4697695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4697972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4698096Z                            module_map=module_map)
2025-05-07T20:32:55.4698268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4698376Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4698469Z E       ^
2025-05-07T20:32:55.4698925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4698931Z 
2025-05-07T20:32:55.4699370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4699380Z 
2025-05-07T20:32:55.4699490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4699725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4699818Z     T=128,
2025-05-07T20:32:55.4699901Z     D=5120,
2025-05-07T20:32:55.4699989Z     scale_ub=None,
2025-05-07T20:32:55.4700091Z     contiguous=False,
2025-05-07T20:32:55.4700182Z     compiled=True,
2025-05-07T20:32:55.4700272Z )
2025-05-07T20:32:55.4700502Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4700723Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.4700727Z 
2025-05-07T20:32:55.4700816Z     @given(
2025-05-07T20:32:55.4700948Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4701056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4701184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4701376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4701497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4701582Z     )
2025-05-07T20:32:55.4701842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4701951Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4702033Z         self,
2025-05-07T20:32:55.4702114Z         T: int,
2025-05-07T20:32:55.4702201Z         D: int,
2025-05-07T20:32:55.4702306Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4702402Z         contiguous: bool,
2025-05-07T20:32:55.4702504Z         compiled: bool,
2025-05-07T20:32:55.4702588Z     ) -> None:
2025-05-07T20:32:55.4702690Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4702776Z     
2025-05-07T20:32:55.4702959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4703038Z     
2025-05-07T20:32:55.4703141Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4703273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4703379Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4703463Z         x0 = x[:, :D]
2025-05-07T20:32:55.4703548Z         x1 = x[:, D:]
2025-05-07T20:32:55.4703631Z     
2025-05-07T20:32:55.4703720Z         if contiguous:
2025-05-07T20:32:55.4703816Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4703917Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4703994Z     
2025-05-07T20:32:55.4704092Z         if scale_ub is not None:
2025-05-07T20:32:55.4704213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4704360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4704440Z             )
2025-05-07T20:32:55.4704528Z         else:
2025-05-07T20:32:55.4704633Z             scale_ub_tensor = None
2025-05-07T20:32:55.4704711Z     
2025-05-07T20:32:55.4704855Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4704951Z             op = silu_mul_quant
2025-05-07T20:32:55.4705050Z             if compiled:
2025-05-07T20:32:55.4705156Z                 op = torch.compile(op)
2025-05-07T20:32:55.4705268Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4705350Z     
2025-05-07T20:32:55.4705446Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4705451Z 
2025-05-07T20:32:55.4705555Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4705698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4705806Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4705912Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4706309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4706409Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4707016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4707123Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4707504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4707745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4708100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4708207Z     kernel = self.compile(
2025-05-07T20:32:55.4708606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4708833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4708977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4708981Z 
2025-05-07T20:32:55.4709197Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08b2d270>
2025-05-07T20:32:55.4710022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4710598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fce8c0>}
2025-05-07T20:32:55.4711373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4716975Z context = <triton._C.libtriton.ir.context object at 0x7efdf7868e30>
2025-05-07T20:32:55.4716986Z 
2025-05-07T20:32:55.4717197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4717488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4717606Z                            module_map=module_map)
2025-05-07T20:32:55.4717787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4717904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4717989Z E       ^
2025-05-07T20:32:55.4718371Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4718376Z 
2025-05-07T20:32:55.4718822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4718832Z 
2025-05-07T20:32:55.4718944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4719190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4719277Z     T=128,
2025-05-07T20:32:55.4719360Z     D=7168,
2025-05-07T20:32:55.4719456Z     scale_ub=1200.0,
2025-05-07T20:32:55.4719551Z     contiguous=False,
2025-05-07T20:32:55.4719641Z     compiled=False,
2025-05-07T20:32:55.4719733Z )
2025-05-07T20:32:55.4719963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4720147Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4720161Z 
2025-05-07T20:32:55.4720246Z     @given(
2025-05-07T20:32:55.4720376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4720492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4720616Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4720742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4720872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4720952Z     )
2025-05-07T20:32:55.4721337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4721450Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4721534Z         self,
2025-05-07T20:32:55.4721618Z         T: int,
2025-05-07T20:32:55.4721705Z         D: int,
2025-05-07T20:32:55.4721815Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4721916Z         contiguous: bool,
2025-05-07T20:32:55.4722010Z         compiled: bool,
2025-05-07T20:32:55.4722096Z     ) -> None:
2025-05-07T20:32:55.4722203Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4722284Z     
2025-05-07T20:32:55.4722463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4722550Z     
2025-05-07T20:32:55.4722649Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4722782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4722931Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4723018Z         x0 = x[:, :D]
2025-05-07T20:32:55.4723105Z         x1 = x[:, D:]
2025-05-07T20:32:55.4723190Z     
2025-05-07T20:32:55.4723285Z         if contiguous:
2025-05-07T20:32:55.4723391Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4723490Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4723568Z     
2025-05-07T20:32:55.4723673Z         if scale_ub is not None:
2025-05-07T20:32:55.4724114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4724323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4724451Z             )
2025-05-07T20:32:55.4724543Z         else:
2025-05-07T20:32:55.4724645Z             scale_ub_tensor = None
2025-05-07T20:32:55.4724732Z     
2025-05-07T20:32:55.4724871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4724968Z             op = silu_mul_quant
2025-05-07T20:32:55.4725073Z             if compiled:
2025-05-07T20:32:55.4725185Z                 op = torch.compile(op)
2025-05-07T20:32:55.4725308Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4725386Z     
2025-05-07T20:32:55.4725488Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4725493Z 
2025-05-07T20:32:55.4725603Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4725744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4725857Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4725973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4726495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4726601Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4726986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4727221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4727590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4727691Z     kernel = self.compile(
2025-05-07T20:32:55.4728099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4728291Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4728429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4728434Z 
2025-05-07T20:32:55.4728657Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08038b80>
2025-05-07T20:32:55.4729469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4730003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08d2aef0>}
2025-05-07T20:32:55.4731057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4731264Z context = <triton._C.libtriton.ir.context object at 0x7efdf77548f0>
2025-05-07T20:32:55.4731271Z 
2025-05-07T20:32:55.4731455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4731733Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4731850Z                            module_map=module_map)
2025-05-07T20:32:55.4732029Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4732135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4732227Z E       ^
2025-05-07T20:32:55.4732665Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4732670Z 
2025-05-07T20:32:55.4733108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4733113Z 
2025-05-07T20:32:55.4733234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4733535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4733626Z     T=128,
2025-05-07T20:32:55.4733712Z     D=5120,
2025-05-07T20:32:55.4733801Z     scale_ub=None,
2025-05-07T20:32:55.4733905Z     contiguous=False,
2025-05-07T20:32:55.4733998Z     compiled=False,
2025-05-07T20:32:55.4734077Z )
2025-05-07T20:32:55.4734314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4734498Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.4734506Z 
2025-05-07T20:32:55.4734588Z     @given(
2025-05-07T20:32:55.4734723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4734831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4734959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4735090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4735213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4735305Z     )
2025-05-07T20:32:55.4735565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4735667Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4735755Z         self,
2025-05-07T20:32:55.4735838Z         T: int,
2025-05-07T20:32:55.4735920Z         D: int,
2025-05-07T20:32:55.4736030Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4736126Z         contiguous: bool,
2025-05-07T20:32:55.4736218Z         compiled: bool,
2025-05-07T20:32:55.4736311Z     ) -> None:
2025-05-07T20:32:55.4736415Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4736492Z     
2025-05-07T20:32:55.4736678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4736760Z     
2025-05-07T20:32:55.4736869Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4737003Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4737098Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4737191Z         x0 = x[:, :D]
2025-05-07T20:32:55.4737281Z         x1 = x[:, D:]
2025-05-07T20:32:55.4737360Z     
2025-05-07T20:32:55.4737455Z         if contiguous:
2025-05-07T20:32:55.4737553Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4737649Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4737733Z     
2025-05-07T20:32:55.4737830Z         if scale_ub is not None:
2025-05-07T20:32:55.4737944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4738093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4738174Z             )
2025-05-07T20:32:55.4738267Z         else:
2025-05-07T20:32:55.4738368Z             scale_ub_tensor = None
2025-05-07T20:32:55.4738446Z     
2025-05-07T20:32:55.4738677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4738776Z             op = silu_mul_quant
2025-05-07T20:32:55.4738867Z             if compiled:
2025-05-07T20:32:55.4738980Z                 op = torch.compile(op)
2025-05-07T20:32:55.4739098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4739179Z     
2025-05-07T20:32:55.4739284Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4739288Z 
2025-05-07T20:32:55.4739394Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4739536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4739644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4739749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4740277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4740465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4740847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4741090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4741448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4741599Z     kernel = self.compile(
2025-05-07T20:32:55.4742000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4742185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4742324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4742330Z 
2025-05-07T20:32:55.4742544Z self = <triton.compiler.compiler.ASTSource object at 0x7eff08fd76d0>
2025-05-07T20:32:55.4743370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4743902Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fccb80>}
2025-05-07T20:32:55.4744682Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4744895Z context = <triton._C.libtriton.ir.context object at 0x7efdf77401f0>
2025-05-07T20:32:55.4744900Z 
2025-05-07T20:32:55.4745074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4745356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4745473Z                            module_map=module_map)
2025-05-07T20:32:55.4745650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4745763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4745846Z E       ^
2025-05-07T20:32:55.4746219Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4746233Z 
2025-05-07T20:32:55.4746664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4746669Z 
2025-05-07T20:32:55.4746780Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4747019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4747102Z     T=128,
2025-05-07T20:32:55.4747184Z     D=5120,
2025-05-07T20:32:55.4747280Z     scale_ub=1200.0,
2025-05-07T20:32:55.4747372Z     contiguous=True,
2025-05-07T20:32:55.4747462Z     compiled=False,
2025-05-07T20:32:55.4747548Z )
2025-05-07T20:32:55.4747859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4748048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.4748053Z 
2025-05-07T20:32:55.4748134Z     @given(
2025-05-07T20:32:55.4748261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4748382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4748507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4748632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4748760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4748844Z     )
2025-05-07T20:32:55.4749103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4749213Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4749296Z         self,
2025-05-07T20:32:55.4749429Z         T: int,
2025-05-07T20:32:55.4749512Z         D: int,
2025-05-07T20:32:55.4749617Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4749728Z         contiguous: bool,
2025-05-07T20:32:55.4749820Z         compiled: bool,
2025-05-07T20:32:55.4749908Z     ) -> None:
2025-05-07T20:32:55.4750015Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4750093Z     
2025-05-07T20:32:55.4750318Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4750403Z     
2025-05-07T20:32:55.4750501Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4750635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4750740Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4750826Z         x0 = x[:, :D]
2025-05-07T20:32:55.4750922Z         x1 = x[:, D:]
2025-05-07T20:32:55.4750999Z     
2025-05-07T20:32:55.4751089Z         if contiguous:
2025-05-07T20:32:55.4751193Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4751292Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4751369Z     
2025-05-07T20:32:55.4751476Z         if scale_ub is not None:
2025-05-07T20:32:55.4751589Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4751738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4751826Z             )
2025-05-07T20:32:55.4751906Z         else:
2025-05-07T20:32:55.4752006Z             scale_ub_tensor = None
2025-05-07T20:32:55.4752095Z     
2025-05-07T20:32:55.4752232Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4752329Z             op = silu_mul_quant
2025-05-07T20:32:55.4752426Z             if compiled:
2025-05-07T20:32:55.4752532Z                 op = torch.compile(op)
2025-05-07T20:32:55.4752652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4752730Z     
2025-05-07T20:32:55.4752826Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4752830Z 
2025-05-07T20:32:55.4752937Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4753077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4753184Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4753301Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4753924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4754035Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4754416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4754651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4755015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4755116Z     kernel = self.compile(
2025-05-07T20:32:55.4755517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4755711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4756017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4756022Z 
2025-05-07T20:32:55.4756249Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf776af20>
2025-05-07T20:32:55.4757057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4757590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fcff40>}
2025-05-07T20:32:55.4758372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4758616Z context = <triton._C.libtriton.ir.context object at 0x7efdf778b8b0>
2025-05-07T20:32:55.4758621Z 
2025-05-07T20:32:55.4758807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4759085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4759249Z                            module_map=module_map)
2025-05-07T20:32:55.4759422Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4759527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4759618Z E       ^
2025-05-07T20:32:55.4759989Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4759994Z 
2025-05-07T20:32:55.4760425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4760432Z 
2025-05-07T20:32:55.4760549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4760783Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4760876Z     T=1,
2025-05-07T20:32:55.4760965Z     D=7168,
2025-05-07T20:32:55.4761054Z     scale_ub=1200.0,
2025-05-07T20:32:55.4761150Z     contiguous=True,
2025-05-07T20:32:55.4761240Z     compiled=True,
2025-05-07T20:32:55.4761322Z )
2025-05-07T20:32:55.4761557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4761733Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.4761737Z 
2025-05-07T20:32:55.4761819Z     @given(
2025-05-07T20:32:55.4761951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4762057Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4762187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4762311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4762436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4762524Z     )
2025-05-07T20:32:55.4762788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4762888Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4762977Z         self,
2025-05-07T20:32:55.4763060Z         T: int,
2025-05-07T20:32:55.4763142Z         D: int,
2025-05-07T20:32:55.4763262Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4763358Z         contiguous: bool,
2025-05-07T20:32:55.4763450Z         compiled: bool,
2025-05-07T20:32:55.4763541Z     ) -> None:
2025-05-07T20:32:55.4763642Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4763729Z     
2025-05-07T20:32:55.4763907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4763987Z     
2025-05-07T20:32:55.4764093Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4764227Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4764324Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4764419Z         x0 = x[:, :D]
2025-05-07T20:32:55.4764506Z         x1 = x[:, D:]
2025-05-07T20:32:55.4764584Z     
2025-05-07T20:32:55.4764765Z         if contiguous:
2025-05-07T20:32:55.4764864Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4764959Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4765045Z     
2025-05-07T20:32:55.4765149Z         if scale_ub is not None:
2025-05-07T20:32:55.4765273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4765417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4765501Z             )
2025-05-07T20:32:55.4765591Z         else:
2025-05-07T20:32:55.4765692Z             scale_ub_tensor = None
2025-05-07T20:32:55.4765771Z     
2025-05-07T20:32:55.4765914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4766014Z             op = silu_mul_quant
2025-05-07T20:32:55.4766106Z             if compiled:
2025-05-07T20:32:55.4766262Z                 op = torch.compile(op)
2025-05-07T20:32:55.4766375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4766453Z     
2025-05-07T20:32:55.4766564Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4766569Z 
2025-05-07T20:32:55.4766672Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4766813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4766964Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4767069Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4767460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4767564Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4768082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4768190Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4768569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4768816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4769171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4769273Z     kernel = self.compile(
2025-05-07T20:32:55.4769682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4769867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4770001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4770012Z 
2025-05-07T20:32:55.4770227Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf776a9e0>
2025-05-07T20:32:55.4771037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4771579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7eff08fcf640>}
2025-05-07T20:32:55.4772353Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4772565Z context = <triton._C.libtriton.ir.context object at 0x7efdf76b07b0>
2025-05-07T20:32:55.4772569Z 
2025-05-07T20:32:55.4772744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4773020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4773140Z                            module_map=module_map)
2025-05-07T20:32:55.4773316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4773428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4773512Z E       ^
2025-05-07T20:32:55.4773997Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4774003Z 
2025-05-07T20:32:55.4774446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4774454Z 
2025-05-07T20:32:55.4774565Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4774798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4774889Z     T=1,
2025-05-07T20:32:55.4774971Z     D=7168,
2025-05-07T20:32:55.4775068Z     scale_ub=1200.0,
2025-05-07T20:32:55.4775162Z     contiguous=False,
2025-05-07T20:32:55.4775252Z     compiled=True,
2025-05-07T20:32:55.4775335Z )
2025-05-07T20:32:55.4775609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4775784Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4775789Z 
2025-05-07T20:32:55.4775881Z     @given(
2025-05-07T20:32:55.4776008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4776113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4776240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4776467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4776594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4776677Z     )
2025-05-07T20:32:55.4776942Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4777048Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4777129Z         self,
2025-05-07T20:32:55.4777210Z         T: int,
2025-05-07T20:32:55.4777297Z         D: int,
2025-05-07T20:32:55.4777401Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4777500Z         contiguous: bool,
2025-05-07T20:32:55.4777597Z         compiled: bool,
2025-05-07T20:32:55.4777680Z     ) -> None:
2025-05-07T20:32:55.4777786Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4777870Z     
2025-05-07T20:32:55.4778046Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4778131Z     
2025-05-07T20:32:55.4778230Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4778369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4778474Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4778560Z         x0 = x[:, :D]
2025-05-07T20:32:55.4778645Z         x1 = x[:, D:]
2025-05-07T20:32:55.4778735Z     
2025-05-07T20:32:55.4778828Z         if contiguous:
2025-05-07T20:32:55.4778927Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4779045Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4779124Z     
2025-05-07T20:32:55.4779221Z         if scale_ub is not None:
2025-05-07T20:32:55.4779344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4779488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4779576Z             )
2025-05-07T20:32:55.4779662Z         else:
2025-05-07T20:32:55.4779763Z             scale_ub_tensor = None
2025-05-07T20:32:55.4779848Z     
2025-05-07T20:32:55.4779985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4780084Z             op = silu_mul_quant
2025-05-07T20:32:55.4780184Z             if compiled:
2025-05-07T20:32:55.4780292Z                 op = torch.compile(op)
2025-05-07T20:32:55.4780404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4780491Z     
2025-05-07T20:32:55.4780589Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4780594Z 
2025-05-07T20:32:55.4780705Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4780843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4780951Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4781071Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4781545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4781645Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4782166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4782274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4782661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4782897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4783255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4783361Z     kernel = self.compile(
2025-05-07T20:32:55.4783761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4783989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4784136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4784141Z 
2025-05-07T20:32:55.4784357Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7621750>
2025-05-07T20:32:55.4785217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4785749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5b5b0>}
2025-05-07T20:32:55.4786533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4786743Z context = <triton._C.libtriton.ir.context object at 0x7efdf76e7870>
2025-05-07T20:32:55.4786748Z 
2025-05-07T20:32:55.4786925Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4787208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4787328Z                            module_map=module_map)
2025-05-07T20:32:55.4787500Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4787613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4787697Z E       ^
2025-05-07T20:32:55.4788075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4788079Z 
2025-05-07T20:32:55.4788510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4788517Z 
2025-05-07T20:32:55.4788627Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4788870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4788952Z     T=1,
2025-05-07T20:32:55.4789040Z     D=7168,
2025-05-07T20:32:55.4789127Z     scale_ub=None,
2025-05-07T20:32:55.4789219Z     contiguous=False,
2025-05-07T20:32:55.4789316Z     compiled=True,
2025-05-07T20:32:55.4789394Z )
2025-05-07T20:32:55.4789625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4789805Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.4789809Z 
2025-05-07T20:32:55.4789891Z     @given(
2025-05-07T20:32:55.4790019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4790132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4790255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4790392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4790515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4790595Z     )
2025-05-07T20:32:55.4790952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4791055Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4791137Z         self,
2025-05-07T20:32:55.4791230Z         T: int,
2025-05-07T20:32:55.4791311Z         D: int,
2025-05-07T20:32:55.4791416Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4791518Z         contiguous: bool,
2025-05-07T20:32:55.4791610Z         compiled: bool,
2025-05-07T20:32:55.4791694Z     ) -> None:
2025-05-07T20:32:55.4791803Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4791880Z     
2025-05-07T20:32:55.4792064Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4792143Z     
2025-05-07T20:32:55.4792240Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4792424Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4792518Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4792603Z         x0 = x[:, :D]
2025-05-07T20:32:55.4792700Z         x1 = x[:, D:]
2025-05-07T20:32:55.4792777Z     
2025-05-07T20:32:55.4792867Z         if contiguous:
2025-05-07T20:32:55.4792973Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4793068Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4793192Z     
2025-05-07T20:32:55.4793296Z         if scale_ub is not None:
2025-05-07T20:32:55.4793409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4793647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4793736Z             )
2025-05-07T20:32:55.4793818Z         else:
2025-05-07T20:32:55.4793924Z             scale_ub_tensor = None
2025-05-07T20:32:55.4794003Z     
2025-05-07T20:32:55.4794141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4794243Z             op = silu_mul_quant
2025-05-07T20:32:55.4794338Z             if compiled:
2025-05-07T20:32:55.4794445Z                 op = torch.compile(op)
2025-05-07T20:32:55.4794568Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4794652Z     
2025-05-07T20:32:55.4794749Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4794887Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4794966Z     
2025-05-07T20:32:55.4795115Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4795231Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4795339Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4795477Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4795628Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4795708Z     
2025-05-07T20:32:55.4795822Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4795826Z 
2025-05-07T20:32:55.4795931Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4796075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4796195Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4796345Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4796938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4797049Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4797426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4797668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4798051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4798333Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4798754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:32:55.4799112Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4799511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4799693Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4800051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4800146Z     fn()
2025-05-07T20:32:55.4800566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4800661Z     self.fn.run(
2025-05-07T20:32:55.4801020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4801160Z     kernel = self.compile(
2025-05-07T20:32:55.4801568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4801760Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4801896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4801907Z 
2025-05-07T20:32:55.4802169Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf76206a0>
2025-05-07T20:32:55.4802979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4803517Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f59bd0>}
2025-05-07T20:32:55.4804300Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4804509Z context = <triton._C.libtriton.ir.context object at 0x7efdf7b33930>
2025-05-07T20:32:55.4804514Z 
2025-05-07T20:32:55.4804689Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4804971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4805095Z                            module_map=module_map)
2025-05-07T20:32:55.4805265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4805375Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4805464Z E       ^
2025-05-07T20:32:55.4805836Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4805843Z 
2025-05-07T20:32:55.4806281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4806286Z 
2025-05-07T20:32:55.4806401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4806635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4806724Z     T=1,
2025-05-07T20:32:55.4806806Z     D=5120,
2025-05-07T20:32:55.4806904Z     scale_ub=1200.0,
2025-05-07T20:32:55.4806997Z     contiguous=False,
2025-05-07T20:32:55.4807086Z     compiled=True,
2025-05-07T20:32:55.4807170Z )
2025-05-07T20:32:55.4807397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4807572Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4807577Z 
2025-05-07T20:32:55.4807665Z     @given(
2025-05-07T20:32:55.4807793Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4807900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4808033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4808159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4808406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4808488Z     )
2025-05-07T20:32:55.4808750Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4808860Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4808942Z         self,
2025-05-07T20:32:55.4809026Z         T: int,
2025-05-07T20:32:55.4809114Z         D: int,
2025-05-07T20:32:55.4809217Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4809312Z         contiguous: bool,
2025-05-07T20:32:55.4809409Z         compiled: bool,
2025-05-07T20:32:55.4809493Z     ) -> None:
2025-05-07T20:32:55.4809593Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4809682Z     
2025-05-07T20:32:55.4809860Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4809993Z     
2025-05-07T20:32:55.4810092Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4810226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4810332Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4810417Z         x0 = x[:, :D]
2025-05-07T20:32:55.4810502Z         x1 = x[:, D:]
2025-05-07T20:32:55.4810586Z     
2025-05-07T20:32:55.4810675Z         if contiguous:
2025-05-07T20:32:55.4810822Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4810926Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4811005Z     
2025-05-07T20:32:55.4811101Z         if scale_ub is not None:
2025-05-07T20:32:55.4811221Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4811365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4811449Z             )
2025-05-07T20:32:55.4811541Z         else:
2025-05-07T20:32:55.4811642Z             scale_ub_tensor = None
2025-05-07T20:32:55.4811730Z     
2025-05-07T20:32:55.4811869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4811965Z             op = silu_mul_quant
2025-05-07T20:32:55.4812065Z             if compiled:
2025-05-07T20:32:55.4812177Z                 op = torch.compile(op)
2025-05-07T20:32:55.4812292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4812378Z     
2025-05-07T20:32:55.4812475Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4812483Z 
2025-05-07T20:32:55.4812587Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4812732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4812842Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4812955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4813344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4813443Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4813972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4814080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4814464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4814708Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4815067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4815177Z     kernel = self.compile(
2025-05-07T20:32:55.4815581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4815769Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4815912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4815917Z 
2025-05-07T20:32:55.4816135Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf77e4a30>
2025-05-07T20:32:55.4817038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4817579Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f583a0>}
2025-05-07T20:32:55.4818361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4818572Z context = <triton._C.libtriton.ir.context object at 0x7efdf7bad3b0>
2025-05-07T20:32:55.4818577Z 
2025-05-07T20:32:55.4818753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4819076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4819193Z                            module_map=module_map)
2025-05-07T20:32:55.4819370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4819482Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4819565Z E       ^
2025-05-07T20:32:55.4819946Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4819994Z 
2025-05-07T20:32:55.4820428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4820432Z 
2025-05-07T20:32:55.4820543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4820782Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4820866Z     T=1,
2025-05-07T20:32:55.4820948Z     D=5120,
2025-05-07T20:32:55.4821046Z     scale_ub=1200.0,
2025-05-07T20:32:55.4821144Z     contiguous=False,
2025-05-07T20:32:55.4821242Z     compiled=False,
2025-05-07T20:32:55.4821323Z )
2025-05-07T20:32:55.4821556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4821742Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4821747Z 
2025-05-07T20:32:55.4821830Z     @given(
2025-05-07T20:32:55.4821960Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4822074Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4822197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4822321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4822448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4822527Z     )
2025-05-07T20:32:55.4822791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4822891Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4822978Z         self,
2025-05-07T20:32:55.4823068Z         T: int,
2025-05-07T20:32:55.4823149Z         D: int,
2025-05-07T20:32:55.4823254Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4823361Z         contiguous: bool,
2025-05-07T20:32:55.4823454Z         compiled: bool,
2025-05-07T20:32:55.4823538Z     ) -> None:
2025-05-07T20:32:55.4823645Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4823726Z     
2025-05-07T20:32:55.4824286Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4824416Z     
2025-05-07T20:32:55.4824520Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4824663Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4824757Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4824842Z         x0 = x[:, :D]
2025-05-07T20:32:55.4824933Z         x1 = x[:, D:]
2025-05-07T20:32:55.4825011Z     
2025-05-07T20:32:55.4825101Z         if contiguous:
2025-05-07T20:32:55.4825205Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4825311Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4825388Z     
2025-05-07T20:32:55.4825490Z         if scale_ub is not None:
2025-05-07T20:32:55.4825832Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4825980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4826069Z             )
2025-05-07T20:32:55.4826150Z         else:
2025-05-07T20:32:55.4826254Z             scale_ub_tensor = None
2025-05-07T20:32:55.4826340Z     
2025-05-07T20:32:55.4826477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4826581Z             op = silu_mul_quant
2025-05-07T20:32:55.4826671Z             if compiled:
2025-05-07T20:32:55.4826778Z                 op = torch.compile(op)
2025-05-07T20:32:55.4826898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4826975Z     
2025-05-07T20:32:55.4827072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4827076Z 
2025-05-07T20:32:55.4827188Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4827391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4827501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4827619Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4828147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4828323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4828705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4828941Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4829308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4829410Z     kernel = self.compile(
2025-05-07T20:32:55.4829820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4830009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4830147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4830152Z 
2025-05-07T20:32:55.4830376Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7b69180>
2025-05-07T20:32:55.4831185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4831729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f58ee0>}
2025-05-07T20:32:55.4832506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4832713Z context = <triton._C.libtriton.ir.context object at 0x7efdf7bfa130>
2025-05-07T20:32:55.4832722Z 
2025-05-07T20:32:55.4832906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4833185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4833311Z                            module_map=module_map)
2025-05-07T20:32:55.4833484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4833663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4833757Z E       ^
2025-05-07T20:32:55.4834133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4834137Z 
2025-05-07T20:32:55.4834573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4834587Z 
2025-05-07T20:32:55.4834699Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4835021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4835113Z     T=16384,
2025-05-07T20:32:55.4835196Z     D=5120,
2025-05-07T20:32:55.4835286Z     scale_ub=1200.0,
2025-05-07T20:32:55.4835386Z     contiguous=False,
2025-05-07T20:32:55.4835480Z     compiled=True,
2025-05-07T20:32:55.4835559Z )
2025-05-07T20:32:55.4835795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4835981Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4835986Z 
2025-05-07T20:32:55.4836074Z     @given(
2025-05-07T20:32:55.4836200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4836309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4836436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4836656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4836779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4836865Z     )
2025-05-07T20:32:55.4837132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4837232Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4837320Z         self,
2025-05-07T20:32:55.4837404Z         T: int,
2025-05-07T20:32:55.4837572Z         D: int,
2025-05-07T20:32:55.4837683Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4837778Z         contiguous: bool,
2025-05-07T20:32:55.4837876Z         compiled: bool,
2025-05-07T20:32:55.4837960Z     ) -> None:
2025-05-07T20:32:55.4838059Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4838143Z     
2025-05-07T20:32:55.4838321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4838400Z     
2025-05-07T20:32:55.4838504Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4838636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4838733Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4838825Z         x0 = x[:, :D]
2025-05-07T20:32:55.4838910Z         x1 = x[:, D:]
2025-05-07T20:32:55.4838992Z     
2025-05-07T20:32:55.4839088Z         if contiguous:
2025-05-07T20:32:55.4839185Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4839286Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4839367Z     
2025-05-07T20:32:55.4839464Z         if scale_ub is not None:
2025-05-07T20:32:55.4839584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4839728Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4839809Z             )
2025-05-07T20:32:55.4839897Z         else:
2025-05-07T20:32:55.4839998Z             scale_ub_tensor = None
2025-05-07T20:32:55.4840076Z     
2025-05-07T20:32:55.4840219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4840315Z             op = silu_mul_quant
2025-05-07T20:32:55.4840409Z             if compiled:
2025-05-07T20:32:55.4840520Z                 op = torch.compile(op)
2025-05-07T20:32:55.4840632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4840722Z     
2025-05-07T20:32:55.4840820Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4840824Z 
2025-05-07T20:32:55.4840927Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4841073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4841183Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4841290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4841683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4841782Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4842302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4842427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4842806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4843136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4843496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4843599Z     kernel = self.compile(
2025-05-07T20:32:55.4844008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4844194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4844352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4844357Z 
2025-05-07T20:32:55.4844578Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf77e4100>
2025-05-07T20:32:55.4845390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4851244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5a9e0>}
2025-05-07T20:32:55.4852056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4852353Z context = <triton._C.libtriton.ir.context object at 0x7efdf7337830>
2025-05-07T20:32:55.4852359Z 
2025-05-07T20:32:55.4852537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4852817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4852948Z                            module_map=module_map)
2025-05-07T20:32:55.4853123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4853240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4853328Z E       ^
2025-05-07T20:32:55.4853704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4853709Z 
2025-05-07T20:32:55.4854154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4854160Z 
2025-05-07T20:32:55.4854272Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4854507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4854598Z     T=2048,
2025-05-07T20:32:55.4854681Z     D=7168,
2025-05-07T20:32:55.4854779Z     scale_ub=1200.0,
2025-05-07T20:32:55.4854875Z     contiguous=False,
2025-05-07T20:32:55.4854965Z     compiled=True,
2025-05-07T20:32:55.4855057Z )
2025-05-07T20:32:55.4855288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4855481Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4855485Z 
2025-05-07T20:32:55.4855577Z     @given(
2025-05-07T20:32:55.4855705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4855811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4855945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4856070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4856197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4856276Z     )
2025-05-07T20:32:55.4856537Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4856647Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4856730Z         self,
2025-05-07T20:32:55.4856812Z         T: int,
2025-05-07T20:32:55.4856900Z         D: int,
2025-05-07T20:32:55.4857007Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4857104Z         contiguous: bool,
2025-05-07T20:32:55.4857202Z         compiled: bool,
2025-05-07T20:32:55.4857382Z     ) -> None:
2025-05-07T20:32:55.4857486Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4857572Z     
2025-05-07T20:32:55.4857751Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4857839Z     
2025-05-07T20:32:55.4857938Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4858071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4858172Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4858258Z         x0 = x[:, :D]
2025-05-07T20:32:55.4858344Z         x1 = x[:, D:]
2025-05-07T20:32:55.4858429Z     
2025-05-07T20:32:55.4858518Z         if contiguous:
2025-05-07T20:32:55.4858617Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4858719Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4858797Z     
2025-05-07T20:32:55.4858946Z         if scale_ub is not None:
2025-05-07T20:32:55.4859068Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4859211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4859307Z             )
2025-05-07T20:32:55.4859389Z         else:
2025-05-07T20:32:55.4859490Z             scale_ub_tensor = None
2025-05-07T20:32:55.4859579Z     
2025-05-07T20:32:55.4859717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4859861Z             op = silu_mul_quant
2025-05-07T20:32:55.4859959Z             if compiled:
2025-05-07T20:32:55.4860066Z                 op = torch.compile(op)
2025-05-07T20:32:55.4860179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4860263Z     
2025-05-07T20:32:55.4860358Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4860363Z 
2025-05-07T20:32:55.4860467Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4860611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4860721Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4860833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4861228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4861327Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4861855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4861964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4862342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4862587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4862946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4863054Z     kernel = self.compile(
2025-05-07T20:32:55.4863460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4863650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4863791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4863796Z 
2025-05-07T20:32:55.4864013Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf73cab30>
2025-05-07T20:32:55.4864840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4865372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7f5bb50>}
2025-05-07T20:32:55.4866148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4866449Z context = <triton._C.libtriton.ir.context object at 0x7efdf73122b0>
2025-05-07T20:32:55.4866454Z 
2025-05-07T20:32:55.4866631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4866918Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4867041Z                            module_map=module_map)
2025-05-07T20:32:55.4867217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4867331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4867415Z E       ^
2025-05-07T20:32:55.4867794Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4867798Z 
2025-05-07T20:32:55.4868232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4868280Z 
2025-05-07T20:32:55.4868392Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4868640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4868726Z     T=1,
2025-05-07T20:32:55.4868811Z     D=5120,
2025-05-07T20:32:55.4868908Z     scale_ub=None,
2025-05-07T20:32:55.4869044Z     contiguous=False,
2025-05-07T20:32:55.4869142Z     compiled=False,
2025-05-07T20:32:55.4869223Z )
2025-05-07T20:32:55.4869452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4869636Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.4869641Z 
2025-05-07T20:32:55.4869726Z     @given(
2025-05-07T20:32:55.4869854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4869971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4870096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4870223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4870352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4870439Z     )
2025-05-07T20:32:55.4870708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4870808Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4870894Z         self,
2025-05-07T20:32:55.4870985Z         T: int,
2025-05-07T20:32:55.4871066Z         D: int,
2025-05-07T20:32:55.4871172Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4871276Z         contiguous: bool,
2025-05-07T20:32:55.4871368Z         compiled: bool,
2025-05-07T20:32:55.4871455Z     ) -> None:
2025-05-07T20:32:55.4871566Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4871643Z     
2025-05-07T20:32:55.4871824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4871910Z     
2025-05-07T20:32:55.4872011Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4872151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4872246Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4872335Z         x0 = x[:, :D]
2025-05-07T20:32:55.4872426Z         x1 = x[:, D:]
2025-05-07T20:32:55.4872504Z     
2025-05-07T20:32:55.4872595Z         if contiguous:
2025-05-07T20:32:55.4872698Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4872796Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4872874Z     
2025-05-07T20:32:55.4872976Z         if scale_ub is not None:
2025-05-07T20:32:55.4873088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4873231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4873323Z             )
2025-05-07T20:32:55.4873404Z         else:
2025-05-07T20:32:55.4873655Z             scale_ub_tensor = None
2025-05-07T20:32:55.4873735Z     
2025-05-07T20:32:55.4873872Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4873978Z             op = silu_mul_quant
2025-05-07T20:32:55.4874069Z             if compiled:
2025-05-07T20:32:55.4874177Z                 op = torch.compile(op)
2025-05-07T20:32:55.4874396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4874481Z     
2025-05-07T20:32:55.4874580Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4874585Z 
2025-05-07T20:32:55.4874694Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4874835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4874944Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4875057Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4875583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4875691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4876070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4876385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4876754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4876855Z     kernel = self.compile(
2025-05-07T20:32:55.4877266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4877495Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4877630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4877635Z 
2025-05-07T20:32:55.4877861Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf73a6c20>
2025-05-07T20:32:55.4878673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4879223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a085e0>}
2025-05-07T20:32:55.4880000Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4880208Z context = <triton._C.libtriton.ir.context object at 0x7efdf7aa7ef0>
2025-05-07T20:32:55.4880212Z 
2025-05-07T20:32:55.4880395Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4880675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4880796Z                            module_map=module_map)
2025-05-07T20:32:55.4880970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4881079Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4881172Z E       ^
2025-05-07T20:32:55.4881553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4881557Z 
2025-05-07T20:32:55.4881996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4882003Z 
2025-05-07T20:32:55.4882114Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4882350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4882440Z     T=4096,
2025-05-07T20:32:55.4882522Z     D=7168,
2025-05-07T20:32:55.4882613Z     scale_ub=1200.0,
2025-05-07T20:32:55.4882714Z     contiguous=False,
2025-05-07T20:32:55.4882805Z     compiled=False,
2025-05-07T20:32:55.4882886Z )
2025-05-07T20:32:55.4883121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4883311Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4883316Z 
2025-05-07T20:32:55.4883405Z     @given(
2025-05-07T20:32:55.4883617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4883728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4883863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4883989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4884115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4884201Z     )
2025-05-07T20:32:55.4884461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4884565Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4884655Z         self,
2025-05-07T20:32:55.4884738Z         T: int,
2025-05-07T20:32:55.4884831Z         D: int,
2025-05-07T20:32:55.4884936Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4885032Z         contiguous: bool,
2025-05-07T20:32:55.4885174Z         compiled: bool,
2025-05-07T20:32:55.4885260Z     ) -> None:
2025-05-07T20:32:55.4885361Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4885448Z     
2025-05-07T20:32:55.4885631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4885711Z     
2025-05-07T20:32:55.4885816Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4885953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4886090Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4886184Z         x0 = x[:, :D]
2025-05-07T20:32:55.4886270Z         x1 = x[:, D:]
2025-05-07T20:32:55.4886349Z     
2025-05-07T20:32:55.4886446Z         if contiguous:
2025-05-07T20:32:55.4886543Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4886647Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4886727Z     
2025-05-07T20:32:55.4886824Z         if scale_ub is not None:
2025-05-07T20:32:55.4886943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4887090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4887172Z             )
2025-05-07T20:32:55.4887260Z         else:
2025-05-07T20:32:55.4887366Z             scale_ub_tensor = None
2025-05-07T20:32:55.4887446Z     
2025-05-07T20:32:55.4887589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4887686Z             op = silu_mul_quant
2025-05-07T20:32:55.4887777Z             if compiled:
2025-05-07T20:32:55.4887892Z                 op = torch.compile(op)
2025-05-07T20:32:55.4888005Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4888089Z     
2025-05-07T20:32:55.4888184Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4888189Z 
2025-05-07T20:32:55.4888292Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4888433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4888541Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4888648Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4889179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4889286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4889669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4889906Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4890266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4890373Z     kernel = self.compile(
2025-05-07T20:32:55.4890776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4890962Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4891102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4891110Z 
2025-05-07T20:32:55.4891327Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf731c880>
2025-05-07T20:32:55.4892227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4892766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a08ca0>}
2025-05-07T20:32:55.4893548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4893752Z context = <triton._C.libtriton.ir.context object at 0x7efdf7aaf170>
2025-05-07T20:32:55.4893757Z 
2025-05-07T20:32:55.4893933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4894258Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4894378Z                            module_map=module_map)
2025-05-07T20:32:55.4894557Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4894663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4894885Z E       ^
2025-05-07T20:32:55.4895266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4895271Z 
2025-05-07T20:32:55.4895704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4895708Z 
2025-05-07T20:32:55.4895818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4896061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4896149Z     T=16384,
2025-05-07T20:32:55.4896240Z     D=7168,
2025-05-07T20:32:55.4896328Z     scale_ub=None,
2025-05-07T20:32:55.4896419Z     contiguous=True,
2025-05-07T20:32:55.4896513Z     compiled=True,
2025-05-07T20:32:55.4896595Z )
2025-05-07T20:32:55.4896825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4897014Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4897022Z 
2025-05-07T20:32:55.4897104Z     @given(
2025-05-07T20:32:55.4897232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4897347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4897470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4897603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4897725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4897806Z     )
2025-05-07T20:32:55.4898073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4898176Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4898259Z         self,
2025-05-07T20:32:55.4898350Z         T: int,
2025-05-07T20:32:55.4898436Z         D: int,
2025-05-07T20:32:55.4898541Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4898645Z         contiguous: bool,
2025-05-07T20:32:55.4898741Z         compiled: bool,
2025-05-07T20:32:55.4898825Z     ) -> None:
2025-05-07T20:32:55.4898938Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4899016Z     
2025-05-07T20:32:55.4899199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4899280Z     
2025-05-07T20:32:55.4899378Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4899522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4899618Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4899704Z         x0 = x[:, :D]
2025-05-07T20:32:55.4899798Z         x1 = x[:, D:]
2025-05-07T20:32:55.4899877Z     
2025-05-07T20:32:55.4899970Z         if contiguous:
2025-05-07T20:32:55.4900075Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4900172Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4900250Z     
2025-05-07T20:32:55.4900439Z         if scale_ub is not None:
2025-05-07T20:32:55.4900554Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4900704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4900788Z             )
2025-05-07T20:32:55.4900871Z         else:
2025-05-07T20:32:55.4900977Z             scale_ub_tensor = None
2025-05-07T20:32:55.4901055Z     
2025-05-07T20:32:55.4901193Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4901297Z             op = silu_mul_quant
2025-05-07T20:32:55.4901389Z             if compiled:
2025-05-07T20:32:55.4901495Z                 op = torch.compile(op)
2025-05-07T20:32:55.4901615Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4901694Z     
2025-05-07T20:32:55.4901790Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4901838Z 
2025-05-07T20:32:55.4901949Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4902085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4902210Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4902316Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4902702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4902853Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4903369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4903473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4903854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4904090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4904456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4904556Z     kernel = self.compile(
2025-05-07T20:32:55.4904962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4905155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4905291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4905296Z 
2025-05-07T20:32:55.4905518Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7ab2380>
2025-05-07T20:32:55.4906326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4906857Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a09b40>}
2025-05-07T20:32:55.4907644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4907849Z context = <triton._C.libtriton.ir.context object at 0x7efdf74fbcb0>
2025-05-07T20:32:55.4907856Z 
2025-05-07T20:32:55.4908037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4908317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4908433Z                            module_map=module_map)
2025-05-07T20:32:55.4908612Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4908720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4908808Z E       ^
2025-05-07T20:32:55.4909188Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4909193Z 
2025-05-07T20:32:55.4909741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4909746Z 
2025-05-07T20:32:55.4909865Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4910103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4910186Z     T=4096,
2025-05-07T20:32:55.4910273Z     D=5120,
2025-05-07T20:32:55.4910364Z     scale_ub=None,
2025-05-07T20:32:55.4910461Z     contiguous=False,
2025-05-07T20:32:55.4910549Z     compiled=True,
2025-05-07T20:32:55.4910627Z )
2025-05-07T20:32:55.4910860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4911044Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.4911048Z 
2025-05-07T20:32:55.4911172Z     @given(
2025-05-07T20:32:55.4911306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4911413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4911540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4911675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4911798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4911924Z     )
2025-05-07T20:32:55.4912189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4912290Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4912378Z         self,
2025-05-07T20:32:55.4912462Z         T: int,
2025-05-07T20:32:55.4912544Z         D: int,
2025-05-07T20:32:55.4912655Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4912751Z         contiguous: bool,
2025-05-07T20:32:55.4912842Z         compiled: bool,
2025-05-07T20:32:55.4912931Z     ) -> None:
2025-05-07T20:32:55.4913034Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4913117Z     
2025-05-07T20:32:55.4913304Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4913384Z     
2025-05-07T20:32:55.4913495Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4913753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4913849Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4913941Z         x0 = x[:, :D]
2025-05-07T20:32:55.4914046Z         x1 = x[:, D:]
2025-05-07T20:32:55.4914123Z     
2025-05-07T20:32:55.4914219Z         if contiguous:
2025-05-07T20:32:55.4914317Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4914412Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4914499Z     
2025-05-07T20:32:55.4914597Z         if scale_ub is not None:
2025-05-07T20:32:55.4914710Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4914860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4914941Z             )
2025-05-07T20:32:55.4915034Z         else:
2025-05-07T20:32:55.4915135Z             scale_ub_tensor = None
2025-05-07T20:32:55.4915214Z     
2025-05-07T20:32:55.4915358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4915461Z             op = silu_mul_quant
2025-05-07T20:32:55.4915555Z             if compiled:
2025-05-07T20:32:55.4915667Z                 op = torch.compile(op)
2025-05-07T20:32:55.4915779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4915861Z     
2025-05-07T20:32:55.4915964Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4915969Z 
2025-05-07T20:32:55.4916073Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4916215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4916323Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4916428Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4916824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4916926Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4917540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4917653Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4918031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4918280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4918640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4918741Z     kernel = self.compile(
2025-05-07T20:32:55.4919155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4919343Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4919478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4919534Z 
2025-05-07T20:32:55.4919753Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7a884f0>
2025-05-07T20:32:55.4920570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4921152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a09240>}
2025-05-07T20:32:55.4921930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4922141Z context = <triton._C.libtriton.ir.context object at 0x7efdf749e9f0>
2025-05-07T20:32:55.4922148Z 
2025-05-07T20:32:55.4922325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4922608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4922729Z                            module_map=module_map)
2025-05-07T20:32:55.4922901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4923010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4923099Z E       ^
2025-05-07T20:32:55.4923473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4923478Z 
2025-05-07T20:32:55.4924256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4924266Z 
2025-05-07T20:32:55.4924426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4924686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4924781Z     T=4096,
2025-05-07T20:32:55.4924864Z     D=5120,
2025-05-07T20:32:55.4924960Z     scale_ub=1200.0,
2025-05-07T20:32:55.4925057Z     contiguous=False,
2025-05-07T20:32:55.4925146Z     compiled=False,
2025-05-07T20:32:55.4925231Z )
2025-05-07T20:32:55.4925460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4925650Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4925654Z 
2025-05-07T20:32:55.4925744Z     @given(
2025-05-07T20:32:55.4925871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4925978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4926106Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4926232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4926360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4926440Z     )
2025-05-07T20:32:55.4926703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4926810Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4926893Z         self,
2025-05-07T20:32:55.4927229Z         T: int,
2025-05-07T20:32:55.4927320Z         D: int,
2025-05-07T20:32:55.4927426Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4927521Z         contiguous: bool,
2025-05-07T20:32:55.4927622Z         compiled: bool,
2025-05-07T20:32:55.4927709Z     ) -> None:
2025-05-07T20:32:55.4927809Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4927894Z     
2025-05-07T20:32:55.4928072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4928156Z     
2025-05-07T20:32:55.4928256Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4928388Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4928488Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4928575Z         x0 = x[:, :D]
2025-05-07T20:32:55.4928659Z         x1 = x[:, D:]
2025-05-07T20:32:55.4928809Z     
2025-05-07T20:32:55.4928899Z         if contiguous:
2025-05-07T20:32:55.4928997Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4929106Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4929185Z     
2025-05-07T20:32:55.4929282Z         if scale_ub is not None:
2025-05-07T20:32:55.4929402Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4929546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4929697Z             )
2025-05-07T20:32:55.4929785Z         else:
2025-05-07T20:32:55.4929886Z             scale_ub_tensor = None
2025-05-07T20:32:55.4929971Z     
2025-05-07T20:32:55.4930110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4930207Z             op = silu_mul_quant
2025-05-07T20:32:55.4930304Z             if compiled:
2025-05-07T20:32:55.4930412Z                 op = torch.compile(op)
2025-05-07T20:32:55.4930527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4930616Z     
2025-05-07T20:32:55.4930713Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4930718Z 
2025-05-07T20:32:55.4930822Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4930971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4931081Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4931195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4931720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4931823Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4932214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4932449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4932807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4932917Z     kernel = self.compile(
2025-05-07T20:32:55.4933324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4933517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4933652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4933659Z 
2025-05-07T20:32:55.4933878Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf74423e0>
2025-05-07T20:32:55.4934696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4935228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a0acb0>}
2025-05-07T20:32:55.4936096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4936303Z context = <triton._C.libtriton.ir.context object at 0x7efdf74afdf0>
2025-05-07T20:32:55.4936307Z 
2025-05-07T20:32:55.4936487Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4936769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4936889Z                            module_map=module_map)
2025-05-07T20:32:55.4937070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4937175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4937259Z E       ^
2025-05-07T20:32:55.4937639Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4937685Z 
2025-05-07T20:32:55.4938119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4938124Z 
2025-05-07T20:32:55.4938249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4938482Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4938565Z     T=4096,
2025-05-07T20:32:55.4938695Z     D=5120,
2025-05-07T20:32:55.4938784Z     scale_ub=1200.0,
2025-05-07T20:32:55.4938876Z     contiguous=False,
2025-05-07T20:32:55.4938971Z     compiled=True,
2025-05-07T20:32:55.4939050Z )
2025-05-07T20:32:55.4939281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4939471Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4939476Z 
2025-05-07T20:32:55.4939558Z     @given(
2025-05-07T20:32:55.4939689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4939799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4939921Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4940057Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4940179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4940259Z     )
2025-05-07T20:32:55.4940527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4940632Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4940713Z         self,
2025-05-07T20:32:55.4940802Z         T: int,
2025-05-07T20:32:55.4940883Z         D: int,
2025-05-07T20:32:55.4940993Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4941087Z         contiguous: bool,
2025-05-07T20:32:55.4941178Z         compiled: bool,
2025-05-07T20:32:55.4941268Z     ) -> None:
2025-05-07T20:32:55.4941369Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4941447Z     
2025-05-07T20:32:55.4941631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4941712Z     
2025-05-07T20:32:55.4941810Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4941949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4942049Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4942136Z         x0 = x[:, :D]
2025-05-07T20:32:55.4942227Z         x1 = x[:, D:]
2025-05-07T20:32:55.4942305Z     
2025-05-07T20:32:55.4942401Z         if contiguous:
2025-05-07T20:32:55.4942500Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4942595Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4942678Z     
2025-05-07T20:32:55.4942774Z         if scale_ub is not None:
2025-05-07T20:32:55.4942887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4943036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4943117Z             )
2025-05-07T20:32:55.4943199Z         else:
2025-05-07T20:32:55.4943305Z             scale_ub_tensor = None
2025-05-07T20:32:55.4943385Z     
2025-05-07T20:32:55.4943529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4943632Z             op = silu_mul_quant
2025-05-07T20:32:55.4943723Z             if compiled:
2025-05-07T20:32:55.4943950Z                 op = torch.compile(op)
2025-05-07T20:32:55.4944066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4944146Z     
2025-05-07T20:32:55.4944248Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4944259Z 
2025-05-07T20:32:55.4944363Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4944501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4944614Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4944719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4945105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4945212Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4945731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4945883Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4946267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4946505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4946914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4947015Z     kernel = self.compile(
2025-05-07T20:32:55.4947418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4947611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4947744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4947749Z 
2025-05-07T20:32:55.4947976Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7594790>
2025-05-07T20:32:55.4948789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4949329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7a0ab90>}
2025-05-07T20:32:55.4950113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4950316Z context = <triton._C.libtriton.ir.context object at 0x7efdf75ac470>
2025-05-07T20:32:55.4950321Z 
2025-05-07T20:32:55.4950504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4950785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4950910Z                            module_map=module_map)
2025-05-07T20:32:55.4951085Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4951192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4951285Z E       ^
2025-05-07T20:32:55.4951661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4951668Z 
2025-05-07T20:32:55.4952102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4952112Z 
2025-05-07T20:32:55.4952222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4952456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4952546Z     T=2048,
2025-05-07T20:32:55.4952628Z     D=7168,
2025-05-07T20:32:55.4952721Z     scale_ub=1200.0,
2025-05-07T20:32:55.4952822Z     contiguous=False,
2025-05-07T20:32:55.4952913Z     compiled=False,
2025-05-07T20:32:55.4952991Z )
2025-05-07T20:32:55.4953313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4953619Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.4953624Z 
2025-05-07T20:32:55.4953719Z     @given(
2025-05-07T20:32:55.4953845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4953951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4954080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4954204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4954326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4954411Z     )
2025-05-07T20:32:55.4954670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4954827Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4954917Z         self,
2025-05-07T20:32:55.4955001Z         T: int,
2025-05-07T20:32:55.4955082Z         D: int,
2025-05-07T20:32:55.4955200Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4955295Z         contiguous: bool,
2025-05-07T20:32:55.4955392Z         compiled: bool,
2025-05-07T20:32:55.4955477Z     ) -> None:
2025-05-07T20:32:55.4955577Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4955708Z     
2025-05-07T20:32:55.4955887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4955966Z     
2025-05-07T20:32:55.4956069Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4956202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4956297Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4956389Z         x0 = x[:, :D]
2025-05-07T20:32:55.4956475Z         x1 = x[:, D:]
2025-05-07T20:32:55.4956553Z     
2025-05-07T20:32:55.4956647Z         if contiguous:
2025-05-07T20:32:55.4956749Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4956851Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4956931Z     
2025-05-07T20:32:55.4957029Z         if scale_ub is not None:
2025-05-07T20:32:55.4957153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4957298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4957378Z             )
2025-05-07T20:32:55.4957470Z         else:
2025-05-07T20:32:55.4957570Z             scale_ub_tensor = None
2025-05-07T20:32:55.4957649Z     
2025-05-07T20:32:55.4957792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4957890Z             op = silu_mul_quant
2025-05-07T20:32:55.4957984Z             if compiled:
2025-05-07T20:32:55.4958096Z                 op = torch.compile(op)
2025-05-07T20:32:55.4958210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4958294Z     
2025-05-07T20:32:55.4958391Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4958395Z 
2025-05-07T20:32:55.4958502Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4958646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4958758Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4958865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4959399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4959507Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4959886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4960128Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4960487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4960595Z     kernel = self.compile(
2025-05-07T20:32:55.4960998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4961187Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4961414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4961419Z 
2025-05-07T20:32:55.4961638Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7432650>
2025-05-07T20:32:55.4962460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4962991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf75285e0>}
2025-05-07T20:32:55.4963777Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4964029Z context = <triton._C.libtriton.ir.context object at 0x7efdf7579870>
2025-05-07T20:32:55.4964033Z 
2025-05-07T20:32:55.4964210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4964497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4964659Z                            module_map=module_map)
2025-05-07T20:32:55.4964833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4964948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4965031Z E       ^
2025-05-07T20:32:55.4965412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4965416Z 
2025-05-07T20:32:55.4965852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4965861Z 
2025-05-07T20:32:55.4965972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4966218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4966302Z     T=1,
2025-05-07T20:32:55.4966391Z     D=7168,
2025-05-07T20:32:55.4966479Z     scale_ub=None,
2025-05-07T20:32:55.4966571Z     contiguous=True,
2025-05-07T20:32:55.4966673Z     compiled=False,
2025-05-07T20:32:55.4966752Z )
2025-05-07T20:32:55.4966982Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4967164Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.4967168Z 
2025-05-07T20:32:55.4967251Z     @given(
2025-05-07T20:32:55.4967379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4967493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4967617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4967752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4967875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4967959Z     )
2025-05-07T20:32:55.4968230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4968332Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4968415Z         self,
2025-05-07T20:32:55.4968507Z         T: int,
2025-05-07T20:32:55.4968590Z         D: int,
2025-05-07T20:32:55.4968694Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4968797Z         contiguous: bool,
2025-05-07T20:32:55.4968890Z         compiled: bool,
2025-05-07T20:32:55.4968973Z     ) -> None:
2025-05-07T20:32:55.4969080Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4969162Z     
2025-05-07T20:32:55.4969339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4969424Z     
2025-05-07T20:32:55.4969523Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4969665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4969760Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4969845Z         x0 = x[:, :D]
2025-05-07T20:32:55.4970021Z         x1 = x[:, D:]
2025-05-07T20:32:55.4970101Z     
2025-05-07T20:32:55.4970190Z         if contiguous:
2025-05-07T20:32:55.4970291Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4970386Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4970466Z     
2025-05-07T20:32:55.4970569Z         if scale_ub is not None:
2025-05-07T20:32:55.4970685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4970829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4970915Z             )
2025-05-07T20:32:55.4970998Z         else:
2025-05-07T20:32:55.4971103Z             scale_ub_tensor = None
2025-05-07T20:32:55.4971181Z     
2025-05-07T20:32:55.4971318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4971421Z             op = silu_mul_quant
2025-05-07T20:32:55.4971555Z             if compiled:
2025-05-07T20:32:55.4971661Z                 op = torch.compile(op)
2025-05-07T20:32:55.4971782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4971867Z     
2025-05-07T20:32:55.4971963Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4971968Z 
2025-05-07T20:32:55.4972079Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4972217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4972410Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4972517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4973043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4973154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4973530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4973771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4974140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4974240Z     kernel = self.compile(
2025-05-07T20:32:55.4974650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4974842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4974976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4974981Z 
2025-05-07T20:32:55.4975204Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7581e10>
2025-05-07T20:32:55.4976019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4976563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7528d30>}
2025-05-07T20:32:55.4977342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4977548Z context = <triton._C.libtriton.ir.context object at 0x7efdf7949430>
2025-05-07T20:32:55.4977559Z 
2025-05-07T20:32:55.4977734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4978011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4978131Z                            module_map=module_map)
2025-05-07T20:32:55.4978302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4978409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4978502Z E       ^
2025-05-07T20:32:55.4978961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4978967Z 
2025-05-07T20:32:55.4979408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4979413Z 
2025-05-07T20:32:55.4979530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4979764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4979873Z     T=16384,
2025-05-07T20:32:55.4979958Z     D=7168,
2025-05-07T20:32:55.4980050Z     scale_ub=1200.0,
2025-05-07T20:32:55.4980149Z     contiguous=False,
2025-05-07T20:32:55.4985558Z     compiled=True,
2025-05-07T20:32:55.4985652Z )
2025-05-07T20:32:55.4985900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4986096Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.4986177Z 
2025-05-07T20:32:55.4986271Z     @given(
2025-05-07T20:32:55.4986400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4986514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4986646Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4986771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4986942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4987030Z     )
2025-05-07T20:32:55.4987294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4987396Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4987488Z         self,
2025-05-07T20:32:55.4987571Z         T: int,
2025-05-07T20:32:55.4987654Z         D: int,
2025-05-07T20:32:55.4987768Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4987864Z         contiguous: bool,
2025-05-07T20:32:55.4987962Z         compiled: bool,
2025-05-07T20:32:55.4988050Z     ) -> None:
2025-05-07T20:32:55.4988152Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4988238Z     
2025-05-07T20:32:55.4988423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4988504Z     
2025-05-07T20:32:55.4988610Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4988744Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4988842Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4988934Z         x0 = x[:, :D]
2025-05-07T20:32:55.4989019Z         x1 = x[:, D:]
2025-05-07T20:32:55.4989097Z     
2025-05-07T20:32:55.4989192Z         if contiguous:
2025-05-07T20:32:55.4989289Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4989389Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4989468Z     
2025-05-07T20:32:55.4989564Z         if scale_ub is not None:
2025-05-07T20:32:55.4989687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4989833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4989915Z             )
2025-05-07T20:32:55.4990002Z         else:
2025-05-07T20:32:55.4990104Z             scale_ub_tensor = None
2025-05-07T20:32:55.4990182Z     
2025-05-07T20:32:55.4990331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4990429Z             op = silu_mul_quant
2025-05-07T20:32:55.4990519Z             if compiled:
2025-05-07T20:32:55.4990635Z                 op = torch.compile(op)
2025-05-07T20:32:55.4990747Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4990832Z     
2025-05-07T20:32:55.4990929Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.4990934Z 
2025-05-07T20:32:55.4991038Z moe/activation_test.py:117: 
2025-05-07T20:32:55.4991182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4991290Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.4991397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4991802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.4991905Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.4992518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.4992633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.4993012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4993262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4993805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4993906Z     kernel = self.compile(
2025-05-07T20:32:55.4994315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4994502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4994690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4994696Z 
2025-05-07T20:32:55.4994921Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf75949a0>
2025-05-07T20:32:55.4995735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4996319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf7529bd0>}
2025-05-07T20:32:55.4997102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4997316Z context = <triton._C.libtriton.ir.context object at 0x7efdf794db30>
2025-05-07T20:32:55.4997320Z 
2025-05-07T20:32:55.4997500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4997779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4997901Z                            module_map=module_map)
2025-05-07T20:32:55.4998078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4998191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.4998274Z E       ^
2025-05-07T20:32:55.4998648Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4998654Z 
2025-05-07T20:32:55.4999091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4999096Z 
2025-05-07T20:32:55.4999206Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4999452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4999534Z     T=1,
2025-05-07T20:32:55.4999617Z     D=7168,
2025-05-07T20:32:55.4999715Z     scale_ub=None,
2025-05-07T20:32:55.4999809Z     contiguous=False,
2025-05-07T20:32:55.4999899Z     compiled=False,
2025-05-07T20:32:55.4999987Z )
2025-05-07T20:32:55.5000217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5000397Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.5000402Z 
2025-05-07T20:32:55.5000491Z     @given(
2025-05-07T20:32:55.5000619Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5000734Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5000859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5000985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5001114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5001198Z     )
2025-05-07T20:32:55.5001458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5001653Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5001740Z         self,
2025-05-07T20:32:55.5001823Z         T: int,
2025-05-07T20:32:55.5001911Z         D: int,
2025-05-07T20:32:55.5002018Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5002117Z         contiguous: bool,
2025-05-07T20:32:55.5002219Z         compiled: bool,
2025-05-07T20:32:55.5002304Z     ) -> None:
2025-05-07T20:32:55.5002412Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5002490Z     
2025-05-07T20:32:55.5002671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5002758Z     
2025-05-07T20:32:55.5002857Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5002992Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5003094Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5003224Z         x0 = x[:, :D]
2025-05-07T20:32:55.5003310Z         x1 = x[:, D:]
2025-05-07T20:32:55.5003397Z     
2025-05-07T20:32:55.5003487Z         if contiguous:
2025-05-07T20:32:55.5003593Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5003694Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5003772Z     
2025-05-07T20:32:55.5003870Z         if scale_ub is not None:
2025-05-07T20:32:55.5003990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5004177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5004264Z             )
2025-05-07T20:32:55.5004346Z         else:
2025-05-07T20:32:55.5004445Z             scale_ub_tensor = None
2025-05-07T20:32:55.5004530Z     
2025-05-07T20:32:55.5004666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5004761Z             op = silu_mul_quant
2025-05-07T20:32:55.5004861Z             if compiled:
2025-05-07T20:32:55.5004967Z                 op = torch.compile(op)
2025-05-07T20:32:55.5005083Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5005167Z     
2025-05-07T20:32:55.5005265Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5005270Z 
2025-05-07T20:32:55.5005385Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5005522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5005631Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5005745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5006266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5006370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5006752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5006987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5007352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5007456Z     kernel = self.compile(
2025-05-07T20:32:55.5007861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5008054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5008188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5008196Z 
2025-05-07T20:32:55.5008412Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf75e4760>
2025-05-07T20:32:55.5009226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5009757Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf752a050>}
2025-05-07T20:32:55.5010630Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5010839Z context = <triton._C.libtriton.ir.context object at 0x7efdf795f670>
2025-05-07T20:32:55.5010848Z 
2025-05-07T20:32:55.5011029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5011310Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5011426Z                            module_map=module_map)
2025-05-07T20:32:55.5011603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5011708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5011791Z E       ^
2025-05-07T20:32:55.5012170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5012251Z 
2025-05-07T20:32:55.5012689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5012694Z 
2025-05-07T20:32:55.5012813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5013048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5013171Z     T=2048,
2025-05-07T20:32:55.5013259Z     D=7168,
2025-05-07T20:32:55.5013347Z     scale_ub=None,
2025-05-07T20:32:55.5013439Z     contiguous=False,
2025-05-07T20:32:55.5013536Z     compiled=True,
2025-05-07T20:32:55.5013613Z )
2025-05-07T20:32:55.5013848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5014031Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.5014036Z 
2025-05-07T20:32:55.5014118Z     @given(
2025-05-07T20:32:55.5014256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5014363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5014491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5014623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5014743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5014833Z     )
2025-05-07T20:32:55.5015095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5015196Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5015285Z         self,
2025-05-07T20:32:55.5015369Z         T: int,
2025-05-07T20:32:55.5015455Z         D: int,
2025-05-07T20:32:55.5015566Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5015663Z         contiguous: bool,
2025-05-07T20:32:55.5015755Z         compiled: bool,
2025-05-07T20:32:55.5015845Z     ) -> None:
2025-05-07T20:32:55.5015951Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5016033Z     
2025-05-07T20:32:55.5016217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5016298Z     
2025-05-07T20:32:55.5016397Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5016540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5016638Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5016730Z         x0 = x[:, :D]
2025-05-07T20:32:55.5016814Z         x1 = x[:, D:]
2025-05-07T20:32:55.5016898Z     
2025-05-07T20:32:55.5016998Z         if contiguous:
2025-05-07T20:32:55.5017096Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5017193Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5017278Z     
2025-05-07T20:32:55.5017377Z         if scale_ub is not None:
2025-05-07T20:32:55.5017495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5017646Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5017729Z             )
2025-05-07T20:32:55.5017811Z         else:
2025-05-07T20:32:55.5017923Z             scale_ub_tensor = None
2025-05-07T20:32:55.5018002Z     
2025-05-07T20:32:55.5018147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5018345Z             op = silu_mul_quant
2025-05-07T20:32:55.5018439Z             if compiled:
2025-05-07T20:32:55.5018553Z                 op = torch.compile(op)
2025-05-07T20:32:55.5018666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5018748Z     
2025-05-07T20:32:55.5018850Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5018855Z 
2025-05-07T20:32:55.5018959Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5019096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5019208Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5019314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5019710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5019810Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5020370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5020485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5020861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5021097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5021511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5021613Z     kernel = self.compile(
2025-05-07T20:32:55.5022020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5022207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5022344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5022350Z 
2025-05-07T20:32:55.5022573Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf79c4e20>
2025-05-07T20:32:55.5023390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5024261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf752b1c0>}
2025-05-07T20:32:55.5025118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5025323Z context = <triton._C.libtriton.ir.context object at 0x7efdf71486b0>
2025-05-07T20:32:55.5025333Z 
2025-05-07T20:32:55.5025515Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5025791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5025917Z                            module_map=module_map)
2025-05-07T20:32:55.5026089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5026196Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5026288Z E       ^
2025-05-07T20:32:55.5026661Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5026666Z 
2025-05-07T20:32:55.5027104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5027109Z 
2025-05-07T20:32:55.5027222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5027455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5027546Z     T=4096,
2025-05-07T20:32:55.5027629Z     D=7168,
2025-05-07T20:32:55.5027716Z     scale_ub=None,
2025-05-07T20:32:55.5027814Z     contiguous=False,
2025-05-07T20:32:55.5028149Z     compiled=True,
2025-05-07T20:32:55.5028231Z )
2025-05-07T20:32:55.5028468Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5028650Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.5028657Z 
2025-05-07T20:32:55.5028743Z     @given(
2025-05-07T20:32:55.5028870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5028977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5029104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5029230Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5029351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5029438Z     )
2025-05-07T20:32:55.5029698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5029872Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5029955Z         self,
2025-05-07T20:32:55.5030038Z         T: int,
2025-05-07T20:32:55.5030132Z         D: int,
2025-05-07T20:32:55.5030237Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5030333Z         contiguous: bool,
2025-05-07T20:32:55.5030431Z         compiled: bool,
2025-05-07T20:32:55.5030597Z     ) -> None:
2025-05-07T20:32:55.5030699Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5030783Z     
2025-05-07T20:32:55.5030961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5031041Z     
2025-05-07T20:32:55.5031151Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5031284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5031380Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5031474Z         x0 = x[:, :D]
2025-05-07T20:32:55.5031560Z         x1 = x[:, D:]
2025-05-07T20:32:55.5031645Z     
2025-05-07T20:32:55.5031739Z         if contiguous:
2025-05-07T20:32:55.5031837Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5031940Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5032019Z     
2025-05-07T20:32:55.5032119Z         if scale_ub is not None:
2025-05-07T20:32:55.5032240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5032383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5032467Z             )
2025-05-07T20:32:55.5032557Z         else:
2025-05-07T20:32:55.5032657Z             scale_ub_tensor = None
2025-05-07T20:32:55.5032738Z     
2025-05-07T20:32:55.5032883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5032981Z             op = silu_mul_quant
2025-05-07T20:32:55.5033082Z             if compiled:
2025-05-07T20:32:55.5033189Z                 op = torch.compile(op)
2025-05-07T20:32:55.5033303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5033388Z     
2025-05-07T20:32:55.5033489Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5033493Z 
2025-05-07T20:32:55.5033693Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5033842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5033950Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5034055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5034447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5034550Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5035074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5035177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5035556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5035799Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5036158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5036352Z     kernel = self.compile(
2025-05-07T20:32:55.5036755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5036941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5037082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5037087Z 
2025-05-07T20:32:55.5037302Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf710c4c0>
2025-05-07T20:32:55.5038111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5038648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a81f0>}
2025-05-07T20:32:55.5039525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5039734Z context = <triton._C.libtriton.ir.context object at 0x7efdf7163830>
2025-05-07T20:32:55.5039777Z 
2025-05-07T20:32:55.5039953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5040236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5040351Z                            module_map=module_map)
2025-05-07T20:32:55.5040523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5040636Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5040717Z E       ^
2025-05-07T20:32:55.5041093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5041098Z 
2025-05-07T20:32:55.5041542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5041546Z 
2025-05-07T20:32:55.5041656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5041898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5041980Z     T=16384,
2025-05-07T20:32:55.5042062Z     D=5120,
2025-05-07T20:32:55.5042156Z     scale_ub=1200.0,
2025-05-07T20:32:55.5042249Z     contiguous=False,
2025-05-07T20:32:55.5042338Z     compiled=False,
2025-05-07T20:32:55.5042424Z )
2025-05-07T20:32:55.5042653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5042845Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.5042858Z 
2025-05-07T20:32:55.5042940Z     @given(
2025-05-07T20:32:55.5043070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5043181Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5043308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5043434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5043561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5043643Z     )
2025-05-07T20:32:55.5043905Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5044013Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5044094Z         self,
2025-05-07T20:32:55.5044181Z         T: int,
2025-05-07T20:32:55.5044262Z         D: int,
2025-05-07T20:32:55.5044365Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5044464Z         contiguous: bool,
2025-05-07T20:32:55.5044555Z         compiled: bool,
2025-05-07T20:32:55.5044638Z     ) -> None:
2025-05-07T20:32:55.5044750Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5044829Z     
2025-05-07T20:32:55.5045006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5045092Z     
2025-05-07T20:32:55.5045308Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5045442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5045543Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5045631Z         x0 = x[:, :D]
2025-05-07T20:32:55.5045716Z         x1 = x[:, D:]
2025-05-07T20:32:55.5045800Z     
2025-05-07T20:32:55.5045889Z         if contiguous:
2025-05-07T20:32:55.5045994Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5046089Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5046166Z     
2025-05-07T20:32:55.5046270Z         if scale_ub is not None:
2025-05-07T20:32:55.5046383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5046528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5046617Z             )
2025-05-07T20:32:55.5046743Z         else:
2025-05-07T20:32:55.5046845Z             scale_ub_tensor = None
2025-05-07T20:32:55.5046936Z     
2025-05-07T20:32:55.5047082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5047178Z             op = silu_mul_quant
2025-05-07T20:32:55.5047286Z             if compiled:
2025-05-07T20:32:55.5047395Z                 op = torch.compile(op)
2025-05-07T20:32:55.5047557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5047635Z     
2025-05-07T20:32:55.5047733Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5047737Z 
2025-05-07T20:32:55.5047849Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5047986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5048094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5048206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5048728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5048841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5049226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5049461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5049828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5049933Z     kernel = self.compile(
2025-05-07T20:32:55.5050337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5050529Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5050664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5050668Z 
2025-05-07T20:32:55.5050891Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf714cb50>
2025-05-07T20:32:55.5051707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5052243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a8700>}
2025-05-07T20:32:55.5053026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5053229Z context = <triton._C.libtriton.ir.context object at 0x7efdf72d4d70>
2025-05-07T20:32:55.5053235Z 
2025-05-07T20:32:55.5053418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5053698Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5053822Z                            module_map=module_map)
2025-05-07T20:32:55.5054074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5054181Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5054270Z E       ^
2025-05-07T20:32:55.5054644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5054652Z 
2025-05-07T20:32:55.5055086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5055097Z 
2025-05-07T20:32:55.5055209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5055444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5055535Z     T=16384,
2025-05-07T20:32:55.5055618Z     D=5120,
2025-05-07T20:32:55.5055708Z     scale_ub=1200.0,
2025-05-07T20:32:55.5055846Z     contiguous=True,
2025-05-07T20:32:55.5055935Z     compiled=True,
2025-05-07T20:32:55.5056014Z )
2025-05-07T20:32:55.5056260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5056444Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5056449Z 
2025-05-07T20:32:55.5056531Z     @given(
2025-05-07T20:32:55.5056662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5056821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5056967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5057115Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5057238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5057324Z     )
2025-05-07T20:32:55.5057584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5057685Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5057777Z         self,
2025-05-07T20:32:55.5057859Z         T: int,
2025-05-07T20:32:55.5057942Z         D: int,
2025-05-07T20:32:55.5058054Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5058156Z         contiguous: bool,
2025-05-07T20:32:55.5058253Z         compiled: bool,
2025-05-07T20:32:55.5058337Z     ) -> None:
2025-05-07T20:32:55.5058437Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5058523Z     
2025-05-07T20:32:55.5058704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5058784Z     
2025-05-07T20:32:55.5058889Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5059021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5059115Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5059209Z         x0 = x[:, :D]
2025-05-07T20:32:55.5059294Z         x1 = x[:, D:]
2025-05-07T20:32:55.5059372Z     
2025-05-07T20:32:55.5059468Z         if contiguous:
2025-05-07T20:32:55.5059566Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5059664Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5059750Z     
2025-05-07T20:32:55.5059846Z         if scale_ub is not None:
2025-05-07T20:32:55.5059967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5060111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5060192Z             )
2025-05-07T20:32:55.5060276Z         else:
2025-05-07T20:32:55.5060375Z             scale_ub_tensor = None
2025-05-07T20:32:55.5060456Z     
2025-05-07T20:32:55.5060599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5060695Z             op = silu_mul_quant
2025-05-07T20:32:55.5060785Z             if compiled:
2025-05-07T20:32:55.5060900Z                 op = torch.compile(op)
2025-05-07T20:32:55.5061012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5061091Z     
2025-05-07T20:32:55.5061194Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5061198Z 
2025-05-07T20:32:55.5061302Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5061449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5061558Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5061748Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5062147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5062246Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5062770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5062879Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5063256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5063497Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5063856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5063998Z     kernel = self.compile(
2025-05-07T20:32:55.5064412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5064599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5064742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5064786Z 
2025-05-07T20:32:55.5065007Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7235720>
2025-05-07T20:32:55.5065815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5066353Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71a97e0>}
2025-05-07T20:32:55.5067163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5067399Z context = <triton._C.libtriton.ir.context object at 0x7efdf7249af0>
2025-05-07T20:32:55.5067403Z 
2025-05-07T20:32:55.5067578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5067858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5067981Z                            module_map=module_map)
2025-05-07T20:32:55.5068154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5068265Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5068352Z E       ^
2025-05-07T20:32:55.5068726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5068733Z 
2025-05-07T20:32:55.5069171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5069179Z 
2025-05-07T20:32:55.5069289Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5069530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5069615Z     T=16384,
2025-05-07T20:32:55.5069698Z     D=5120,
2025-05-07T20:32:55.5069793Z     scale_ub=None,
2025-05-07T20:32:55.5069886Z     contiguous=False,
2025-05-07T20:32:55.5069976Z     compiled=True,
2025-05-07T20:32:55.5070060Z )
2025-05-07T20:32:55.5070290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5070478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.5070482Z 
2025-05-07T20:32:55.5070573Z     @given(
2025-05-07T20:32:55.5070699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5070811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5070941Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5071149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5071279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5071359Z     )
2025-05-07T20:32:55.5071621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5071734Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5071814Z         self,
2025-05-07T20:32:55.5071896Z         T: int,
2025-05-07T20:32:55.5071986Z         D: int,
2025-05-07T20:32:55.5072090Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5072185Z         contiguous: bool,
2025-05-07T20:32:55.5072284Z         compiled: bool,
2025-05-07T20:32:55.5072367Z     ) -> None:
2025-05-07T20:32:55.5072467Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5072549Z     
2025-05-07T20:32:55.5072726Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5072859Z     
2025-05-07T20:32:55.5072957Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5073097Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5073197Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5073282Z         x0 = x[:, :D]
2025-05-07T20:32:55.5073366Z         x1 = x[:, D:]
2025-05-07T20:32:55.5073452Z     
2025-05-07T20:32:55.5073685Z         if contiguous:
2025-05-07T20:32:55.5073783Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5073884Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5073961Z     
2025-05-07T20:32:55.5074057Z         if scale_ub is not None:
2025-05-07T20:32:55.5074174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5074317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5074405Z             )
2025-05-07T20:32:55.5074487Z         else:
2025-05-07T20:32:55.5074587Z             scale_ub_tensor = None
2025-05-07T20:32:55.5074674Z     
2025-05-07T20:32:55.5074811Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5074908Z             op = silu_mul_quant
2025-05-07T20:32:55.5075010Z             if compiled:
2025-05-07T20:32:55.5075116Z                 op = torch.compile(op)
2025-05-07T20:32:55.5075227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5075311Z     
2025-05-07T20:32:55.5075405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5075413Z 
2025-05-07T20:32:55.5075516Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5075658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5075765Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5075875Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5076261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5076360Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5076889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5076995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5077375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5077615Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5077977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5078081Z     kernel = self.compile(
2025-05-07T20:32:55.5078482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5078668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5078807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5078814Z 
2025-05-07T20:32:55.5079031Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf72ccdc0>
2025-05-07T20:32:55.5079974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5080508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71aa680>}
2025-05-07T20:32:55.5081294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5081500Z context = <triton._C.libtriton.ir.context object at 0x7efdf70cfdb0>
2025-05-07T20:32:55.5081505Z 
2025-05-07T20:32:55.5081680Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5082006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5082127Z                            module_map=module_map)
2025-05-07T20:32:55.5082299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5082412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5082494Z E       ^
2025-05-07T20:32:55.5082918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5082922Z 
2025-05-07T20:32:55.5083355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5083359Z 
2025-05-07T20:32:55.5083470Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5083715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5083803Z     T=2048,
2025-05-07T20:32:55.5083895Z     D=5120,
2025-05-07T20:32:55.5083988Z     scale_ub=None,
2025-05-07T20:32:55.5084081Z     contiguous=False,
2025-05-07T20:32:55.5084180Z     compiled=True,
2025-05-07T20:32:55.5084258Z )
2025-05-07T20:32:55.5084491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5084683Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.5084688Z 
2025-05-07T20:32:55.5084775Z     @given(
2025-05-07T20:32:55.5084903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5085015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5085138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5085270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5085394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5085473Z     )
2025-05-07T20:32:55.5085740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5085844Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5085927Z         self,
2025-05-07T20:32:55.5086019Z         T: int,
2025-05-07T20:32:55.5086100Z         D: int,
2025-05-07T20:32:55.5086208Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5086312Z         contiguous: bool,
2025-05-07T20:32:55.5086404Z         compiled: bool,
2025-05-07T20:32:55.5086487Z     ) -> None:
2025-05-07T20:32:55.5086593Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5086674Z     
2025-05-07T20:32:55.5086852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5086938Z     
2025-05-07T20:32:55.5087035Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5087175Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5087269Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5087354Z         x0 = x[:, :D]
2025-05-07T20:32:55.5087448Z         x1 = x[:, D:]
2025-05-07T20:32:55.5087525Z     
2025-05-07T20:32:55.5087614Z         if contiguous:
2025-05-07T20:32:55.5087721Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5087818Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5087895Z     
2025-05-07T20:32:55.5087999Z         if scale_ub is not None:
2025-05-07T20:32:55.5088194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5088340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5088430Z             )
2025-05-07T20:32:55.5088512Z         else:
2025-05-07T20:32:55.5088617Z             scale_ub_tensor = None
2025-05-07T20:32:55.5088694Z     
2025-05-07T20:32:55.5088831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5088932Z             op = silu_mul_quant
2025-05-07T20:32:55.5089023Z             if compiled:
2025-05-07T20:32:55.5089129Z                 op = torch.compile(op)
2025-05-07T20:32:55.5089248Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5089326Z     
2025-05-07T20:32:55.5089422Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5089467Z 
2025-05-07T20:32:55.5089578Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5089715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5089833Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5089939Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5090325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5090473Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5090993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5091096Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5091478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5091711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5092079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5092181Z     kernel = self.compile(
2025-05-07T20:32:55.5092588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5092779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5092913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5092921Z 
2025-05-07T20:32:55.5093138Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf7950970>
2025-05-07T20:32:55.5093955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5094486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71aa560>}
2025-05-07T20:32:55.5095276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5095480Z context = <triton._C.libtriton.ir.context object at 0x7efdf70a6b70>
2025-05-07T20:32:55.5095488Z 
2025-05-07T20:32:55.5095668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5095945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5096062Z                            module_map=module_map)
2025-05-07T20:32:55.5096239Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5096346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5096429Z E       ^
2025-05-07T20:32:55.5096811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5096818Z 
2025-05-07T20:32:55.5097330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5097336Z 
2025-05-07T20:32:55.5097455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5097693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5097781Z     T=2048,
2025-05-07T20:32:55.5097869Z     D=5120,
2025-05-07T20:32:55.5097960Z     scale_ub=1200.0,
2025-05-07T20:32:55.5098054Z     contiguous=False,
2025-05-07T20:32:55.5098150Z     compiled=True,
2025-05-07T20:32:55.5098231Z )
2025-05-07T20:32:55.5098467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5098655Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.5098660Z 
2025-05-07T20:32:55.5098742Z     @given(
2025-05-07T20:32:55.5098918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5099023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5099154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5099290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5099412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5099502Z     )
2025-05-07T20:32:55.5099805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5099906Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5099999Z         self,
2025-05-07T20:32:55.5100081Z         T: int,
2025-05-07T20:32:55.5100162Z         D: int,
2025-05-07T20:32:55.5100273Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5100370Z         contiguous: bool,
2025-05-07T20:32:55.5100465Z         compiled: bool,
2025-05-07T20:32:55.5100555Z     ) -> None:
2025-05-07T20:32:55.5100656Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5100736Z     
2025-05-07T20:32:55.5100921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5101002Z     
2025-05-07T20:32:55.5101099Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5101244Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5101340Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5101431Z         x0 = x[:, :D]
2025-05-07T20:32:55.5101520Z         x1 = x[:, D:]
2025-05-07T20:32:55.5101601Z     
2025-05-07T20:32:55.5101696Z         if contiguous:
2025-05-07T20:32:55.5101793Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5101889Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5101970Z     
2025-05-07T20:32:55.5102067Z         if scale_ub is not None:
2025-05-07T20:32:55.5102178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5102326Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5102407Z             )
2025-05-07T20:32:55.5102487Z         else:
2025-05-07T20:32:55.5102595Z             scale_ub_tensor = None
2025-05-07T20:32:55.5102673Z     
2025-05-07T20:32:55.5102814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5102914Z             op = silu_mul_quant
2025-05-07T20:32:55.5103005Z             if compiled:
2025-05-07T20:32:55.5103116Z                 op = torch.compile(op)
2025-05-07T20:32:55.5103229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5103309Z     
2025-05-07T20:32:55.5103412Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5103417Z 
2025-05-07T20:32:55.5103520Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5103657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5103769Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5103874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5104266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5104367Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5104889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5105090Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5105467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5105702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5106070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5106169Z     kernel = self.compile(
2025-05-07T20:32:55.5106575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5106763Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5106897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5106943Z 
2025-05-07T20:32:55.5107169Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf72f82b0>
2025-05-07T20:32:55.5107990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5108595Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf71ab370>}
2025-05-07T20:32:55.5109373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5109577Z context = <triton._C.libtriton.ir.context object at 0x7efdf6f8d2f0>
2025-05-07T20:32:55.5109590Z 
2025-05-07T20:32:55.5109765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5110043Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5110165Z                            module_map=module_map)
2025-05-07T20:32:55.5110335Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5110441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5110532Z E       ^
2025-05-07T20:32:55.5110904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5110909Z 
2025-05-07T20:32:55.5111350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5111354Z 
2025-05-07T20:32:55.5111465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5111700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5111791Z     T=4096,
2025-05-07T20:32:55.5111873Z     D=5120,
2025-05-07T20:32:55.5111962Z     scale_ub=1200.0,
2025-05-07T20:32:55.5112060Z     contiguous=True,
2025-05-07T20:32:55.5112156Z     compiled=True,
2025-05-07T20:32:55.5112234Z )
2025-05-07T20:32:55.5112469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5112652Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5112660Z 
2025-05-07T20:32:55.5112748Z     @given(
2025-05-07T20:32:55.5112874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5112980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5113123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5113249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5113371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5113461Z     )
2025-05-07T20:32:55.5119047Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5119179Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5119274Z         self,
2025-05-07T20:32:55.5119358Z         T: int,
2025-05-07T20:32:55.5119567Z         D: int,
2025-05-07T20:32:55.5119675Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5119774Z         contiguous: bool,
2025-05-07T20:32:55.5119873Z         compiled: bool,
2025-05-07T20:32:55.5119963Z     ) -> None:
2025-05-07T20:32:55.5120065Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5120152Z     
2025-05-07T20:32:55.5120333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5120413Z     
2025-05-07T20:32:55.5120521Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5120657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5120753Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5120846Z         x0 = x[:, :D]
2025-05-07T20:32:55.5120932Z         x1 = x[:, D:]
2025-05-07T20:32:55.5121061Z     
2025-05-07T20:32:55.5121159Z         if contiguous:
2025-05-07T20:32:55.5121257Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5121360Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5121444Z     
2025-05-07T20:32:55.5121544Z         if scale_ub is not None:
2025-05-07T20:32:55.5121664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5121807Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5121937Z             )
2025-05-07T20:32:55.5122024Z         else:
2025-05-07T20:32:55.5122125Z             scale_ub_tensor = None
2025-05-07T20:32:55.5122204Z     
2025-05-07T20:32:55.5122353Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5122449Z             op = silu_mul_quant
2025-05-07T20:32:55.5122542Z             if compiled:
2025-05-07T20:32:55.5122655Z                 op = torch.compile(op)
2025-05-07T20:32:55.5122768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5122852Z     
2025-05-07T20:32:55.5122951Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5122956Z 
2025-05-07T20:32:55.5123060Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5123214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5123322Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5123431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5124116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5124272Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5124886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5124992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5125371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5125618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5125979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5126085Z     kernel = self.compile(
2025-05-07T20:32:55.5126494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5126681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5126828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5126833Z 
2025-05-07T20:32:55.5127051Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf70d2aa0>
2025-05-07T20:32:55.5127863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5128408Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f00310>}
2025-05-07T20:32:55.5129416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5129633Z context = <triton._C.libtriton.ir.context object at 0x7efdf6f201b0>
2025-05-07T20:32:55.5129638Z 
2025-05-07T20:32:55.5129814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5130097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5130213Z                            module_map=module_map)
2025-05-07T20:32:55.5130385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5130497Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5130658Z E       ^
2025-05-07T20:32:55.5131034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5131039Z 
2025-05-07T20:32:55.5131482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5131486Z 
2025-05-07T20:32:55.5131597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5131903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5131985Z     T=128,
2025-05-07T20:32:55.5132068Z     D=5120,
2025-05-07T20:32:55.5132163Z     scale_ub=1200.0,
2025-05-07T20:32:55.5132254Z     contiguous=False,
2025-05-07T20:32:55.5132343Z     compiled=True,
2025-05-07T20:32:55.5132433Z )
2025-05-07T20:32:55.5132662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5132847Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.5132863Z 
2025-05-07T20:32:55.5132945Z     @given(
2025-05-07T20:32:55.5133072Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5133192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5133316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5133439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5133567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5133651Z     )
2025-05-07T20:32:55.5133910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5134019Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5134102Z         self,
2025-05-07T20:32:55.5134184Z         T: int,
2025-05-07T20:32:55.5134274Z         D: int,
2025-05-07T20:32:55.5134378Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5134481Z         contiguous: bool,
2025-05-07T20:32:55.5134574Z         compiled: bool,
2025-05-07T20:32:55.5134659Z     ) -> None:
2025-05-07T20:32:55.5134770Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5134850Z     
2025-05-07T20:32:55.5135028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5135120Z     
2025-05-07T20:32:55.5135219Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5135353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5135454Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5135548Z         x0 = x[:, :D]
2025-05-07T20:32:55.5135634Z         x1 = x[:, D:]
2025-05-07T20:32:55.5135720Z     
2025-05-07T20:32:55.5135810Z         if contiguous:
2025-05-07T20:32:55.5135907Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5136009Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5136088Z     
2025-05-07T20:32:55.5136195Z         if scale_ub is not None:
2025-05-07T20:32:55.5136308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5136453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5136543Z             )
2025-05-07T20:32:55.5136625Z         else:
2025-05-07T20:32:55.5136728Z             scale_ub_tensor = None
2025-05-07T20:32:55.5136812Z     
2025-05-07T20:32:55.5137037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5137137Z             op = silu_mul_quant
2025-05-07T20:32:55.5137236Z             if compiled:
2025-05-07T20:32:55.5137346Z                 op = torch.compile(op)
2025-05-07T20:32:55.5137463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5137548Z     
2025-05-07T20:32:55.5137645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5137649Z 
2025-05-07T20:32:55.5137761Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5137897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5138004Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5138115Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5138502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5138644Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5139174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5139277Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5139658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5139937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5140295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5140401Z     kernel = self.compile(
2025-05-07T20:32:55.5140801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5140987Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5141129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5141133Z 
2025-05-07T20:32:55.5141354Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf70d3eb0>
2025-05-07T20:32:55.5142169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5142705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f01090>}
2025-05-07T20:32:55.5143488Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5143692Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c62930>
2025-05-07T20:32:55.5143698Z 
2025-05-07T20:32:55.5143873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5144161Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5144277Z                            module_map=module_map)
2025-05-07T20:32:55.5144455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5144565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5144647Z E       ^
2025-05-07T20:32:55.5145029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5145033Z 
2025-05-07T20:32:55.5145467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5145471Z 
2025-05-07T20:32:55.5145588Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5145824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5145911Z     T=16384,
2025-05-07T20:32:55.5145999Z     D=7168,
2025-05-07T20:32:55.5146087Z     scale_ub=1200.0,
2025-05-07T20:32:55.5146257Z     contiguous=True,
2025-05-07T20:32:55.5146354Z     compiled=True,
2025-05-07T20:32:55.5146432Z )
2025-05-07T20:32:55.5146663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5146856Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5146861Z 
2025-05-07T20:32:55.5146944Z     @given(
2025-05-07T20:32:55.5147069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5147183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5147306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5147438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5147560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5147711Z     )
2025-05-07T20:32:55.5147978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5148078Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5148167Z         self,
2025-05-07T20:32:55.5148261Z         T: int,
2025-05-07T20:32:55.5148343Z         D: int,
2025-05-07T20:32:55.5148447Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5148549Z         contiguous: bool,
2025-05-07T20:32:55.5148686Z         compiled: bool,
2025-05-07T20:32:55.5148776Z     ) -> None:
2025-05-07T20:32:55.5148875Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5148955Z     
2025-05-07T20:32:55.5149140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5149225Z     
2025-05-07T20:32:55.5149326Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5149465Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5149560Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5149647Z         x0 = x[:, :D]
2025-05-07T20:32:55.5149743Z         x1 = x[:, D:]
2025-05-07T20:32:55.5149823Z     
2025-05-07T20:32:55.5149913Z         if contiguous:
2025-05-07T20:32:55.5150016Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5150117Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5150196Z     
2025-05-07T20:32:55.5150301Z         if scale_ub is not None:
2025-05-07T20:32:55.5150414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5150568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5150649Z             )
2025-05-07T20:32:55.5150731Z         else:
2025-05-07T20:32:55.5150839Z             scale_ub_tensor = None
2025-05-07T20:32:55.5150917Z     
2025-05-07T20:32:55.5151056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5151159Z             op = silu_mul_quant
2025-05-07T20:32:55.5151250Z             if compiled:
2025-05-07T20:32:55.5151360Z                 op = torch.compile(op)
2025-05-07T20:32:55.5151483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5151564Z     
2025-05-07T20:32:55.5151661Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5151673Z 
2025-05-07T20:32:55.5151777Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5151918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5152033Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5152139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5152530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5152636Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5153152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5153256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5153756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5153996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5154454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5154556Z     kernel = self.compile(
2025-05-07T20:32:55.5154960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5155154Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5155291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5155296Z 
2025-05-07T20:32:55.5155519Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f6ecb0>
2025-05-07T20:32:55.5156331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5156910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f02290>}
2025-05-07T20:32:55.5157695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5157941Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c111b0>
2025-05-07T20:32:55.5157946Z 
2025-05-07T20:32:55.5158127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5158407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5158521Z                            module_map=module_map)
2025-05-07T20:32:55.5158700Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5158810Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5158898Z E       ^
2025-05-07T20:32:55.5159277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5159282Z 
2025-05-07T20:32:55.5159717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5159724Z 
2025-05-07T20:32:55.5159841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5160075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5160168Z     T=16384,
2025-05-07T20:32:55.5160250Z     D=5120,
2025-05-07T20:32:55.5160339Z     scale_ub=1200.0,
2025-05-07T20:32:55.5160440Z     contiguous=True,
2025-05-07T20:32:55.5160529Z     compiled=False,
2025-05-07T20:32:55.5160608Z )
2025-05-07T20:32:55.5160844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5161032Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5161039Z 
2025-05-07T20:32:55.5161121Z     @given(
2025-05-07T20:32:55.5161256Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5161368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5161497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5161623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5161746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5161831Z     )
2025-05-07T20:32:55.5162092Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5162193Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5162281Z         self,
2025-05-07T20:32:55.5162364Z         T: int,
2025-05-07T20:32:55.5162445Z         D: int,
2025-05-07T20:32:55.5162558Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5162654Z         contiguous: bool,
2025-05-07T20:32:55.5162747Z         compiled: bool,
2025-05-07T20:32:55.5162838Z     ) -> None:
2025-05-07T20:32:55.5162940Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5163018Z     
2025-05-07T20:32:55.5163294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5163376Z     
2025-05-07T20:32:55.5163485Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5163621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5163719Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5163810Z         x0 = x[:, :D]
2025-05-07T20:32:55.5163896Z         x1 = x[:, D:]
2025-05-07T20:32:55.5163975Z     
2025-05-07T20:32:55.5164072Z         if contiguous:
2025-05-07T20:32:55.5164171Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5164265Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5164351Z     
2025-05-07T20:32:55.5164449Z         if scale_ub is not None:
2025-05-07T20:32:55.5164562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5164713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5164839Z             )
2025-05-07T20:32:55.5164928Z         else:
2025-05-07T20:32:55.5165030Z             scale_ub_tensor = None
2025-05-07T20:32:55.5165116Z     
2025-05-07T20:32:55.5165265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5165362Z             op = silu_mul_quant
2025-05-07T20:32:55.5165453Z             if compiled:
2025-05-07T20:32:55.5165611Z                 op = torch.compile(op)
2025-05-07T20:32:55.5165724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5165803Z     
2025-05-07T20:32:55.5165908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5165913Z 
2025-05-07T20:32:55.5166017Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5166161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5166269Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5166374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5166908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5167015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5167400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5167645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5168007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5168114Z     kernel = self.compile(
2025-05-07T20:32:55.5168521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5168707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5168848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5168856Z 
2025-05-07T20:32:55.5169072Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f94d30>
2025-05-07T20:32:55.5169894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5170428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f011b0>}
2025-05-07T20:32:55.5171211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5171421Z context = <triton._C.libtriton.ir.context object at 0x7efdf6c4a5b0>
2025-05-07T20:32:55.5171425Z 
2025-05-07T20:32:55.5171601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5171890Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5172085Z                            module_map=module_map)
2025-05-07T20:32:55.5172264Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5172375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5172458Z E       ^
2025-05-07T20:32:55.5172835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5172846Z 
2025-05-07T20:32:55.5173282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5173287Z 
2025-05-07T20:32:55.5173397Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5173636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5173721Z     T=1,
2025-05-07T20:32:55.5173846Z     D=7168,
2025-05-07T20:32:55.5173943Z     scale_ub=1200.0,
2025-05-07T20:32:55.5174036Z     contiguous=False,
2025-05-07T20:32:55.5174129Z     compiled=False,
2025-05-07T20:32:55.5174217Z )
2025-05-07T20:32:55.5174449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5174637Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.5174682Z 
2025-05-07T20:32:55.5174764Z     @given(
2025-05-07T20:32:55.5174891Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5175006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5175128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5175253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5175381Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5175462Z     )
2025-05-07T20:32:55.5175722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5175833Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5175915Z         self,
2025-05-07T20:32:55.5176003Z         T: int,
2025-05-07T20:32:55.5176085Z         D: int,
2025-05-07T20:32:55.5176196Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5176297Z         contiguous: bool,
2025-05-07T20:32:55.5176389Z         compiled: bool,
2025-05-07T20:32:55.5176473Z     ) -> None:
2025-05-07T20:32:55.5176583Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5176661Z     
2025-05-07T20:32:55.5176839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5176922Z     
2025-05-07T20:32:55.5177019Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5177153Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5177255Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5177341Z         x0 = x[:, :D]
2025-05-07T20:32:55.5177431Z         x1 = x[:, D:]
2025-05-07T20:32:55.5177509Z     
2025-05-07T20:32:55.5177598Z         if contiguous:
2025-05-07T20:32:55.5177703Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5177801Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5177880Z     
2025-05-07T20:32:55.5177988Z         if scale_ub is not None:
2025-05-07T20:32:55.5178101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5178243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5178331Z             )
2025-05-07T20:32:55.5178415Z         else:
2025-05-07T20:32:55.5178515Z             scale_ub_tensor = None
2025-05-07T20:32:55.5178600Z     
2025-05-07T20:32:55.5178736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5178834Z             op = silu_mul_quant
2025-05-07T20:32:55.5178930Z             if compiled:
2025-05-07T20:32:55.5179037Z                 op = torch.compile(op)
2025-05-07T20:32:55.5179158Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5179236Z     
2025-05-07T20:32:55.5179333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5179340Z 
2025-05-07T20:32:55.5179454Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5179592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5179891Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5180008Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5180534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5180646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5181045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5181282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5181649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5181749Z     kernel = self.compile(
2025-05-07T20:32:55.5182196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5182395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5182531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5182536Z 
2025-05-07T20:32:55.5182761Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6f94700>
2025-05-07T20:32:55.5183616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5184156Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f02680>}
2025-05-07T20:32:55.5184936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5185147Z context = <triton._C.libtriton.ir.context object at 0x7efdf6be2030>
2025-05-07T20:32:55.5185152Z 
2025-05-07T20:32:55.5185336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5185616Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5185742Z                            module_map=module_map)
2025-05-07T20:32:55.5185915Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5186021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5186113Z E       ^
2025-05-07T20:32:55.5186513Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5186518Z 
2025-05-07T20:32:55.5186980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5186995Z 
2025-05-07T20:32:55.5187105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5187346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5187438Z     T=4096,
2025-05-07T20:32:55.5187523Z     D=7168,
2025-05-07T20:32:55.5187613Z     scale_ub=1200.0,
2025-05-07T20:32:55.5187720Z     contiguous=False,
2025-05-07T20:32:55.5187811Z     compiled=True,
2025-05-07T20:32:55.5187892Z )
2025-05-07T20:32:55.5188127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5188313Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.5188318Z 
2025-05-07T20:32:55.5188407Z     @given(
2025-05-07T20:32:55.5188538Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5188644Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5188778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5188905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5189028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5189197Z     )
2025-05-07T20:32:55.5189461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5189560Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5189651Z         self,
2025-05-07T20:32:55.5189733Z         T: int,
2025-05-07T20:32:55.5189816Z         D: int,
2025-05-07T20:32:55.5189927Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5190022Z         contiguous: bool,
2025-05-07T20:32:55.5190119Z         compiled: bool,
2025-05-07T20:32:55.5190202Z     ) -> None:
2025-05-07T20:32:55.5190304Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5190387Z     
2025-05-07T20:32:55.5190569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5190649Z     
2025-05-07T20:32:55.5190754Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5190931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5191026Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5191125Z         x0 = x[:, :D]
2025-05-07T20:32:55.5191211Z         x1 = x[:, D:]
2025-05-07T20:32:55.5191289Z     
2025-05-07T20:32:55.5191384Z         if contiguous:
2025-05-07T20:32:55.5191480Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5191623Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5191701Z     
2025-05-07T20:32:55.5191798Z         if scale_ub is not None:
2025-05-07T20:32:55.5191917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5192061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5192142Z             )
2025-05-07T20:32:55.5192231Z         else:
2025-05-07T20:32:55.5192331Z             scale_ub_tensor = None
2025-05-07T20:32:55.5192410Z     
2025-05-07T20:32:55.5192554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5192654Z             op = silu_mul_quant
2025-05-07T20:32:55.5192744Z             if compiled:
2025-05-07T20:32:55.5192858Z                 op = torch.compile(op)
2025-05-07T20:32:55.5192974Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5193056Z     
2025-05-07T20:32:55.5193152Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5193157Z 
2025-05-07T20:32:55.5193260Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5193409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5193618Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5193727Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5194121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5194220Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5194741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5194854Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5195237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5195479Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5195839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5195942Z     kernel = self.compile(
2025-05-07T20:32:55.5196350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5196538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5196679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5196683Z 
2025-05-07T20:32:55.5196900Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ca8b80>
2025-05-07T20:32:55.5197893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5198433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6f03b50>}
2025-05-07T20:32:55.5199217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5199429Z context = <triton._C.libtriton.ir.context object at 0x7efdf6b2a5b0>
2025-05-07T20:32:55.5199434Z 
2025-05-07T20:32:55.5199609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5199887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5200052Z                            module_map=module_map)
2025-05-07T20:32:55.5200225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5200345Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5200429Z E       ^
2025-05-07T20:32:55.5200804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5200848Z 
2025-05-07T20:32:55.5201289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5201294Z 
2025-05-07T20:32:55.5201405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5201647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5201732Z     T=128,
2025-05-07T20:32:55.5201814Z     D=7168,
2025-05-07T20:32:55.5201911Z     scale_ub=1200.0,
2025-05-07T20:32:55.5202004Z     contiguous=False,
2025-05-07T20:32:55.5202095Z     compiled=True,
2025-05-07T20:32:55.5202183Z )
2025-05-07T20:32:55.5202413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5202599Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:55.5202604Z 
2025-05-07T20:32:55.5202692Z     @given(
2025-05-07T20:32:55.5202820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5202935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5203059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5203185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5203312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5203392Z     )
2025-05-07T20:32:55.5203657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5203765Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5203848Z         self,
2025-05-07T20:32:55.5203932Z         T: int,
2025-05-07T20:32:55.5204020Z         D: int,
2025-05-07T20:32:55.5204124Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5204221Z         contiguous: bool,
2025-05-07T20:32:55.5204326Z         compiled: bool,
2025-05-07T20:32:55.5204410Z     ) -> None:
2025-05-07T20:32:55.5204518Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5204596Z     
2025-05-07T20:32:55.5204775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5204867Z     
2025-05-07T20:32:55.5204965Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5205098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5205199Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5205284Z         x0 = x[:, :D]
2025-05-07T20:32:55.5205369Z         x1 = x[:, D:]
2025-05-07T20:32:55.5205454Z     
2025-05-07T20:32:55.5205542Z         if contiguous:
2025-05-07T20:32:55.5205641Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5205742Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5205822Z     
2025-05-07T20:32:55.5205920Z         if scale_ub is not None:
2025-05-07T20:32:55.5206041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5206267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5206358Z             )
2025-05-07T20:32:55.5206440Z         else:
2025-05-07T20:32:55.5206539Z             scale_ub_tensor = None
2025-05-07T20:32:55.5206625Z     
2025-05-07T20:32:55.5206763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5206859Z             op = silu_mul_quant
2025-05-07T20:32:55.5206955Z             if compiled:
2025-05-07T20:32:55.5207062Z                 op = torch.compile(op)
2025-05-07T20:32:55.5207176Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5207263Z     
2025-05-07T20:32:55.5207360Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5207365Z 
2025-05-07T20:32:55.5207475Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5207613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5207762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5207873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5208265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5208367Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5208891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5209035Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5209418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5209654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5210016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5210125Z     kernel = self.compile(
2025-05-07T20:32:55.5210528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5210717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5210858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5210862Z 
2025-05-07T20:32:55.5211082Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6b003a0>
2025-05-07T20:32:55.5211899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5212433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e30280>}
2025-05-07T20:32:55.5213225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5213429Z context = <triton._C.libtriton.ir.context object at 0x7efdf6e21330>
2025-05-07T20:32:55.5213434Z 
2025-05-07T20:32:55.5213610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5213898Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5214014Z                            module_map=module_map)
2025-05-07T20:32:55.5214192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5214299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5214381Z E       ^
2025-05-07T20:32:55.5214763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5214770Z 
2025-05-07T20:32:55.5215203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5215207Z 
2025-05-07T20:32:55.5215428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5215674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5215760Z     T=2048,
2025-05-07T20:32:55.5215856Z     D=7168,
2025-05-07T20:32:55.5215947Z     scale_ub=None,
2025-05-07T20:32:55.5216040Z     contiguous=True,
2025-05-07T20:32:55.5216138Z     compiled=True,
2025-05-07T20:32:55.5216220Z )
2025-05-07T20:32:55.5216452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5216639Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.5216644Z 
2025-05-07T20:32:55.5216727Z     @given(
2025-05-07T20:32:55.5216854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5216966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5217136Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5217267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5217393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5217473Z     )
2025-05-07T20:32:55.5217741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5217883Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5217965Z         self,
2025-05-07T20:32:55.5218055Z         T: int,
2025-05-07T20:32:55.5218137Z         D: int,
2025-05-07T20:32:55.5218242Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5218343Z         contiguous: bool,
2025-05-07T20:32:55.5218435Z         compiled: bool,
2025-05-07T20:32:55.5218519Z     ) -> None:
2025-05-07T20:32:55.5218626Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5218705Z     
2025-05-07T20:32:55.5218890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5218974Z     
2025-05-07T20:32:55.5219075Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5219214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5219313Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5219398Z         x0 = x[:, :D]
2025-05-07T20:32:55.5219491Z         x1 = x[:, D:]
2025-05-07T20:32:55.5219569Z     
2025-05-07T20:32:55.5219658Z         if contiguous:
2025-05-07T20:32:55.5219766Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5219860Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5219937Z     
2025-05-07T20:32:55.5220041Z         if scale_ub is not None:
2025-05-07T20:32:55.5220153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5220301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5220382Z             )
2025-05-07T20:32:55.5220463Z         else:
2025-05-07T20:32:55.5220571Z             scale_ub_tensor = None
2025-05-07T20:32:55.5220648Z     
2025-05-07T20:32:55.5220787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5220888Z             op = silu_mul_quant
2025-05-07T20:32:55.5220978Z             if compiled:
2025-05-07T20:32:55.5221087Z                 op = torch.compile(op)
2025-05-07T20:32:55.5221205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5221283Z     
2025-05-07T20:32:55.5221378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5221383Z 
2025-05-07T20:32:55.5221494Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5221630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5221742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5221848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5222234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5222337Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5222858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5222965Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5223430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5223669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5224403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5224517Z     kernel = self.compile(
2025-05-07T20:32:55.5224919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5225110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5225244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5225248Z 
2025-05-07T20:32:55.5225471Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6e67700>
2025-05-07T20:32:55.5226447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5226976Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e30dc0>}
2025-05-07T20:32:55.5227832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5228034Z context = <triton._C.libtriton.ir.context object at 0x7efdf6e97b70>
2025-05-07T20:32:55.5228039Z 
2025-05-07T20:32:55.5228219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5228501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5228615Z                            module_map=module_map)
2025-05-07T20:32:55.5228802Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5228906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5228994Z E       ^
2025-05-07T20:32:55.5229367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5229375Z 
2025-05-07T20:32:55.5229807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5229812Z 
2025-05-07T20:32:55.5229927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5230161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5230245Z     T=16384,
2025-05-07T20:32:55.5230332Z     D=5120,
2025-05-07T20:32:55.5230422Z     scale_ub=None,
2025-05-07T20:32:55.5230524Z     contiguous=False,
2025-05-07T20:32:55.5230614Z     compiled=False,
2025-05-07T20:32:55.5230694Z )
2025-05-07T20:32:55.5230932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5231119Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.5231124Z 
2025-05-07T20:32:55.5231206Z     @given(
2025-05-07T20:32:55.5231340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5231448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5231570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5231702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5231823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5231910Z     )
2025-05-07T20:32:55.5232172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5232272Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5232363Z         self,
2025-05-07T20:32:55.5232446Z         T: int,
2025-05-07T20:32:55.5232528Z         D: int,
2025-05-07T20:32:55.5232643Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5232866Z         contiguous: bool,
2025-05-07T20:32:55.5232961Z         compiled: bool,
2025-05-07T20:32:55.5233053Z     ) -> None:
2025-05-07T20:32:55.5233153Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5233234Z     
2025-05-07T20:32:55.5233417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5233570Z     
2025-05-07T20:32:55.5233676Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5233810Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5235707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5235765Z 
2025-05-07T20:32:55.5235894Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.5235937Z 
2025-05-07T20:32:55.5236050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5236293Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5236378Z     T=4096,
2025-05-07T20:32:55.5236459Z     D=7168,
2025-05-07T20:32:55.5236554Z     scale_ub=1200.0,
2025-05-07T20:32:55.5236646Z     contiguous=True,
2025-05-07T20:32:55.5236735Z     compiled=True,
2025-05-07T20:32:55.5236819Z )
2025-05-07T20:32:55.5237047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5237233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5237241Z 
2025-05-07T20:32:55.5237321Z     @given(
2025-05-07T20:32:55.5237452Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5237564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5237687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5237812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5237943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5238025Z     )
2025-05-07T20:32:55.5238289Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5238394Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5238475Z         self,
2025-05-07T20:32:55.5238561Z         T: int,
2025-05-07T20:32:55.5238643Z         D: int,
2025-05-07T20:32:55.5238746Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5238847Z         contiguous: bool,
2025-05-07T20:32:55.5238937Z         compiled: bool,
2025-05-07T20:32:55.5239023Z     ) -> None:
2025-05-07T20:32:55.5239128Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5239205Z     
2025-05-07T20:32:55.5239385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5239469Z     
2025-05-07T20:32:55.5239567Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5239697Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5241558Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5241567Z 
2025-05-07T20:32:55.5241693Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.5241703Z 
2025-05-07T20:32:55.5241896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5242131Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5242220Z     T=16384,
2025-05-07T20:32:55.5242303Z     D=7168,
2025-05-07T20:32:55.5242392Z     scale_ub=None,
2025-05-07T20:32:55.5242490Z     contiguous=False,
2025-05-07T20:32:55.5242578Z     compiled=False,
2025-05-07T20:32:55.5242655Z )
2025-05-07T20:32:55.5242888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5243073Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.5243078Z 
2025-05-07T20:32:55.5243167Z     @given(
2025-05-07T20:32:55.5243291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5243397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5243567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5243692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5243819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5243906Z     )
2025-05-07T20:32:55.5244166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5244266Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5244424Z         self,
2025-05-07T20:32:55.5247443Z         T: int,
2025-05-07T20:32:55.5247545Z         D: int,
2025-05-07T20:32:55.5247661Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5247755Z         contiguous: bool,
2025-05-07T20:32:55.5247847Z         compiled: bool,
2025-05-07T20:32:55.5247939Z     ) -> None:
2025-05-07T20:32:55.5248040Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5248117Z     
2025-05-07T20:32:55.5248302Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5250221Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5250234Z 
2025-05-07T20:32:55.5250360Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5250364Z 
2025-05-07T20:32:55.5250478Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5250711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5250799Z     T=2048,
2025-05-07T20:32:55.5250881Z     D=7168,
2025-05-07T20:32:55.5250970Z     scale_ub=1200.0,
2025-05-07T20:32:55.5251082Z     contiguous=True,
2025-05-07T20:32:55.5251172Z     compiled=True,
2025-05-07T20:32:55.5251258Z )
2025-05-07T20:32:55.5251488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5257094Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5257104Z 
2025-05-07T20:32:55.5257209Z     @given(
2025-05-07T20:32:55.5257341Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5257458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5257587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5257712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5257840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5257921Z     )
2025-05-07T20:32:55.5258186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5258294Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5258379Z         self,
2025-05-07T20:32:55.5258461Z         T: int,
2025-05-07T20:32:55.5258549Z         D: int,
2025-05-07T20:32:55.5258653Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5258828Z         contiguous: bool,
2025-05-07T20:32:55.5258928Z         compiled: bool,
2025-05-07T20:32:55.5259012Z     ) -> None:
2025-05-07T20:32:55.5259113Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5259199Z     
2025-05-07T20:32:55.5259382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5259464Z     
2025-05-07T20:32:55.5259569Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5259703Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5261586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5261635Z 
2025-05-07T20:32:55.5261764Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.5261768Z 
2025-05-07T20:32:55.5261926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5262252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5262337Z     T=2048,
2025-05-07T20:32:55.5262425Z     D=7168,
2025-05-07T20:32:55.5262514Z     scale_ub=None,
2025-05-07T20:32:55.5262607Z     contiguous=True,
2025-05-07T20:32:55.5262704Z     compiled=False,
2025-05-07T20:32:55.5262785Z )
2025-05-07T20:32:55.5263014Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5263206Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5263214Z 
2025-05-07T20:32:55.5263299Z     @given(
2025-05-07T20:32:55.5263431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5263540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5263662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5263795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5263921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5264002Z     )
2025-05-07T20:32:55.5264271Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5264372Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5264463Z         self,
2025-05-07T20:32:55.5264545Z         T: int,
2025-05-07T20:32:55.5264628Z         D: int,
2025-05-07T20:32:55.5264738Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5264834Z         contiguous: bool,
2025-05-07T20:32:55.5264925Z         compiled: bool,
2025-05-07T20:32:55.5265016Z     ) -> None:
2025-05-07T20:32:55.5265116Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5265193Z     
2025-05-07T20:32:55.5265378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5265459Z     
2025-05-07T20:32:55.5265558Z >       x_sign = torch.sign(x)
2025-05-07T20:32:55.5267484Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5267492Z 
2025-05-07T20:32:55.5267619Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:55.5267625Z 
2025-05-07T20:32:55.5267741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5267975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5268113Z     T=1,
2025-05-07T20:32:55.5268198Z     D=7168,
2025-05-07T20:32:55.5268289Z     scale_ub=1200.0,
2025-05-07T20:32:55.5268387Z     contiguous=True,
2025-05-07T20:32:55.5268476Z     compiled=False,
2025-05-07T20:32:55.5268557Z )
2025-05-07T20:32:55.5268792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5268967Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5268972Z 
2025-05-07T20:32:55.5269055Z     @given(
2025-05-07T20:32:55.5269184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5269291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5269419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5269543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5269706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5269792Z     )
2025-05-07T20:32:55.5270056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5270156Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5270245Z         self,
2025-05-07T20:32:55.5270327Z         T: int,
2025-05-07T20:32:55.5270408Z         D: int,
2025-05-07T20:32:55.5270559Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5270714Z         contiguous: bool,
2025-05-07T20:32:55.5270807Z         compiled: bool,
2025-05-07T20:32:55.5270898Z     ) -> None:
2025-05-07T20:32:55.5270998Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5271082Z     
2025-05-07T20:32:55.5271259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5271338Z     
2025-05-07T20:32:55.5271443Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5271575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5271674Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5271765Z         x0 = x[:, :D]
2025-05-07T20:32:55.5271851Z         x1 = x[:, D:]
2025-05-07T20:32:55.5271929Z     
2025-05-07T20:32:55.5272027Z         if contiguous:
2025-05-07T20:32:55.5272127Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5272222Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5272306Z     
2025-05-07T20:32:55.5272405Z         if scale_ub is not None:
2025-05-07T20:32:55.5272527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5272670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5272752Z             )
2025-05-07T20:32:55.5272843Z         else:
2025-05-07T20:32:55.5272943Z             scale_ub_tensor = None
2025-05-07T20:32:55.5273022Z     
2025-05-07T20:32:55.5273167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5273264Z             op = silu_mul_quant
2025-05-07T20:32:55.5273355Z             if compiled:
2025-05-07T20:32:55.5273470Z                 op = torch.compile(op)
2025-05-07T20:32:55.5273773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5273851Z     
2025-05-07T20:32:55.5273959Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5273963Z 
2025-05-07T20:32:55.5274068Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5274213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5274325Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5274436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5274975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5275079Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5275461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5275704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5276066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5276224Z     kernel = self.compile(
2025-05-07T20:32:55.5276633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5276822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5276991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5276997Z 
2025-05-07T20:32:55.5277239Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ecf9a0>
2025-05-07T20:32:55.5278058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5278594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e32cb0>}
2025-05-07T20:32:55.5279419Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5279667Z context = <triton._C.libtriton.ir.context object at 0x7efdf6dfe4b0>
2025-05-07T20:32:55.5279715Z 
2025-05-07T20:32:55.5279893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5280179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5280300Z                            module_map=module_map)
2025-05-07T20:32:55.5280476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5280593Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5280678Z E       ^
2025-05-07T20:32:55.5281059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5281063Z 
2025-05-07T20:32:55.5281498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5281503Z 
2025-05-07T20:32:55.5281614Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5281863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5281948Z     T=128,
2025-05-07T20:32:55.5282030Z     D=5120,
2025-05-07T20:32:55.5282125Z     scale_ub=None,
2025-05-07T20:32:55.5282217Z     contiguous=True,
2025-05-07T20:32:55.5282314Z     compiled=False,
2025-05-07T20:32:55.5282392Z )
2025-05-07T20:32:55.5282621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5282806Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5282814Z 
2025-05-07T20:32:55.5282894Z     @given(
2025-05-07T20:32:55.5283023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5283135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5283265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5283390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5283515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5283598Z     )
2025-05-07T20:32:55.5283865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5283964Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5284046Z         self,
2025-05-07T20:32:55.5284132Z         T: int,
2025-05-07T20:32:55.5284213Z         D: int,
2025-05-07T20:32:55.5284315Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5284421Z         contiguous: bool,
2025-05-07T20:32:55.5284511Z         compiled: bool,
2025-05-07T20:32:55.5284594Z     ) -> None:
2025-05-07T20:32:55.5284702Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5284779Z     
2025-05-07T20:32:55.5284955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5285039Z     
2025-05-07T20:32:55.5285184Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5285325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5285419Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5285507Z         x0 = x[:, :D]
2025-05-07T20:32:55.5285598Z         x1 = x[:, D:]
2025-05-07T20:32:55.5285678Z     
2025-05-07T20:32:55.5285767Z         if contiguous:
2025-05-07T20:32:55.5285871Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5285966Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5286045Z     
2025-05-07T20:32:55.5286148Z         if scale_ub is not None:
2025-05-07T20:32:55.5286258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5286401Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5286489Z             )
2025-05-07T20:32:55.5286613Z         else:
2025-05-07T20:32:55.5286713Z             scale_ub_tensor = None
2025-05-07T20:32:55.5286797Z     
2025-05-07T20:32:55.5286936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5287038Z             op = silu_mul_quant
2025-05-07T20:32:55.5287129Z             if compiled:
2025-05-07T20:32:55.5287235Z                 op = torch.compile(op)
2025-05-07T20:32:55.5287422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5287503Z     
2025-05-07T20:32:55.5287644Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5287649Z 
2025-05-07T20:32:55.5287758Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5287896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5288002Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5288113Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5288636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5288748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5289126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5289364Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5289730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5289836Z     kernel = self.compile(
2025-05-07T20:32:55.5290238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5290428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5290565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5290569Z 
2025-05-07T20:32:55.5290791Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6df32e0>
2025-05-07T20:32:55.5291604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5292142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6e33640>}
2025-05-07T20:32:55.5292926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5293130Z context = <triton._C.libtriton.ir.context object at 0x7efdf6ac9a30>
2025-05-07T20:32:55.5293134Z 
2025-05-07T20:32:55.5293315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5293592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5293714Z                            module_map=module_map)
2025-05-07T20:32:55.5293936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5294044Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5294134Z E       ^
2025-05-07T20:32:55.5294508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5294518Z 
2025-05-07T20:32:55.5294951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5294962Z 
2025-05-07T20:32:55.5295072Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5295306Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5295395Z     T=128,
2025-05-07T20:32:55.5295477Z     D=7168,
2025-05-07T20:32:55.5295566Z     scale_ub=None,
2025-05-07T20:32:55.5295705Z     contiguous=True,
2025-05-07T20:32:55.5295796Z     compiled=False,
2025-05-07T20:32:55.5295876Z )
2025-05-07T20:32:55.5296115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5296294Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5296299Z 
2025-05-07T20:32:55.5296386Z     @given(
2025-05-07T20:32:55.5296513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5296702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5296833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5296958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5297080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5297168Z     )
2025-05-07T20:32:55.5297428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5297528Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5297618Z         self,
2025-05-07T20:32:55.5297701Z         T: int,
2025-05-07T20:32:55.5297781Z         D: int,
2025-05-07T20:32:55.5297891Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5297990Z         contiguous: bool,
2025-05-07T20:32:55.5298090Z         compiled: bool,
2025-05-07T20:32:55.5298173Z     ) -> None:
2025-05-07T20:32:55.5298274Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5298357Z     
2025-05-07T20:32:55.5298539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5298620Z     
2025-05-07T20:32:55.5298726Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5298858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5298952Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5299043Z         x0 = x[:, :D]
2025-05-07T20:32:55.5299128Z         x1 = x[:, D:]
2025-05-07T20:32:55.5299205Z     
2025-05-07T20:32:55.5299299Z         if contiguous:
2025-05-07T20:32:55.5299395Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5299497Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5299575Z     
2025-05-07T20:32:55.5299671Z         if scale_ub is not None:
2025-05-07T20:32:55.5299790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5299933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5300013Z             )
2025-05-07T20:32:55.5300098Z         else:
2025-05-07T20:32:55.5300197Z             scale_ub_tensor = None
2025-05-07T20:32:55.5300277Z     
2025-05-07T20:32:55.5300422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5300518Z             op = silu_mul_quant
2025-05-07T20:32:55.5300609Z             if compiled:
2025-05-07T20:32:55.5300720Z                 op = torch.compile(op)
2025-05-07T20:32:55.5300832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5300909Z     
2025-05-07T20:32:55.5301010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5301015Z 
2025-05-07T20:32:55.5301118Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5301264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5301371Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5301531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5302060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5302163Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5302545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5302790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5303146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5303254Z     kernel = self.compile(
2025-05-07T20:32:55.5303657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5303884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5304026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5304030Z 
2025-05-07T20:32:55.5304246Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6ecf460>
2025-05-07T20:32:55.5305115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5305685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6ab4160>}
2025-05-07T20:32:55.5306471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5306678Z context = <triton._C.libtriton.ir.context object at 0x7efdf6a1d1b0>
2025-05-07T20:32:55.5306682Z 
2025-05-07T20:32:55.5306858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5307146Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5307263Z                            module_map=module_map)
2025-05-07T20:32:55.5307436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5307549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5307630Z E       ^
2025-05-07T20:32:55.5308007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5308012Z 
2025-05-07T20:32:55.5308443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5308450Z 
2025-05-07T20:32:55.5308560Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5308800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5308887Z     T=2048,
2025-05-07T20:32:55.5308976Z     D=7168,
2025-05-07T20:32:55.5309064Z     scale_ub=1200.0,
2025-05-07T20:32:55.5309153Z     contiguous=True,
2025-05-07T20:32:55.5309251Z     compiled=False,
2025-05-07T20:32:55.5309335Z )
2025-05-07T20:32:55.5309568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5309758Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5309762Z 
2025-05-07T20:32:55.5309848Z     @given(
2025-05-07T20:32:55.5309974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5310088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5310211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5310343Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5310467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5310547Z     )
2025-05-07T20:32:55.5310860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5310962Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5311044Z         self,
2025-05-07T20:32:55.5311134Z         T: int,
2025-05-07T20:32:55.5311218Z         D: int,
2025-05-07T20:32:55.5311322Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5311426Z         contiguous: bool,
2025-05-07T20:32:55.5311518Z         compiled: bool,
2025-05-07T20:32:55.5311602Z     ) -> None:
2025-05-07T20:32:55.5311711Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5311790Z     
2025-05-07T20:32:55.5311970Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5313916Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5314008Z 
2025-05-07T20:32:55.5314184Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5314189Z 
2025-05-07T20:32:55.5314301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5314534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5314624Z     T=1,
2025-05-07T20:32:55.5314707Z     D=5120,
2025-05-07T20:32:55.5314796Z     scale_ub=1200.0,
2025-05-07T20:32:55.5314894Z     contiguous=True,
2025-05-07T20:32:55.5314985Z     compiled=False,
2025-05-07T20:32:55.5315064Z )
2025-05-07T20:32:55.5315299Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5315473Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5315480Z 
2025-05-07T20:32:55.5315570Z     @given(
2025-05-07T20:32:55.5315695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5315801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5315932Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5316060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5316183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5316271Z     )
2025-05-07T20:32:55.5316530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5316631Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5316719Z         self,
2025-05-07T20:32:55.5316806Z         T: int,
2025-05-07T20:32:55.5316913Z         D: int,
2025-05-07T20:32:55.5317031Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5317138Z         contiguous: bool,
2025-05-07T20:32:55.5317237Z         compiled: bool,
2025-05-07T20:32:55.5317319Z     ) -> None:
2025-05-07T20:32:55.5317421Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5317506Z     
2025-05-07T20:32:55.5317682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5317761Z     
2025-05-07T20:32:55.5317869Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5318006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5318101Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5318193Z         x0 = x[:, :D]
2025-05-07T20:32:55.5318277Z         x1 = x[:, D:]
2025-05-07T20:32:55.5318354Z     
2025-05-07T20:32:55.5318451Z         if contiguous:
2025-05-07T20:32:55.5318547Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5318646Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5318735Z     
2025-05-07T20:32:55.5318839Z         if scale_ub is not None:
2025-05-07T20:32:55.5318955Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5319099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5319233Z             )
2025-05-07T20:32:55.5319318Z         else:
2025-05-07T20:32:55.5319421Z             scale_ub_tensor = None
2025-05-07T20:32:55.5319507Z     
2025-05-07T20:32:55.5319648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5319747Z             op = silu_mul_quant
2025-05-07T20:32:55.5319847Z             if compiled:
2025-05-07T20:32:55.5319956Z                 op = torch.compile(op)
2025-05-07T20:32:55.5320074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5320152Z     
2025-05-07T20:32:55.5320249Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5320253Z 
2025-05-07T20:32:55.5320362Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5320500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5320675Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5320787Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5321314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5321422Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5321798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5322170Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5322542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5322642Z     kernel = self.compile(
2025-05-07T20:32:55.5323043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5323239Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5323378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5323382Z 
2025-05-07T20:32:55.5323607Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf6a68af0>
2025-05-07T20:32:55.5324826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5325373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6ab4940>}
2025-05-07T20:32:55.5326161Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5326366Z context = <triton._C.libtriton.ir.context object at 0x7efdf6a6f130>
2025-05-07T20:32:55.5326374Z 
2025-05-07T20:32:55.5326556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5326838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5326971Z                            module_map=module_map)
2025-05-07T20:32:55.5327153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5327261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5327351Z E       ^
2025-05-07T20:32:55.5327723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5327728Z 
2025-05-07T20:32:55.5328157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5328161Z 
2025-05-07T20:32:55.5328276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5328511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5328598Z     T=2048,
2025-05-07T20:32:55.5328679Z     D=5120,
2025-05-07T20:32:55.5328766Z     scale_ub=None,
2025-05-07T20:32:55.5329036Z     contiguous=True,
2025-05-07T20:32:55.5329127Z     compiled=False,
2025-05-07T20:32:55.5329204Z )
2025-05-07T20:32:55.5329438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5329622Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5329627Z 
2025-05-07T20:32:55.5329709Z     @given(
2025-05-07T20:32:55.5329841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5329946Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5330074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5330200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5330320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5330470Z     )
2025-05-07T20:32:55.5330727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5330826Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5330917Z         self,
2025-05-07T20:32:55.5330998Z         T: int,
2025-05-07T20:32:55.5331079Z         D: int,
2025-05-07T20:32:55.5331190Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5331285Z         contiguous: bool,
2025-05-07T20:32:55.5331442Z         compiled: bool,
2025-05-07T20:32:55.5331594Z     ) -> None:
2025-05-07T20:32:55.5331696Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5331779Z     
2025-05-07T20:32:55.5331954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5332033Z     
2025-05-07T20:32:55.5332135Z >       x_sign = torch.sign(x)
2025-05-07T20:32:55.5333977Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5333989Z 
2025-05-07T20:32:55.5334121Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:55.5334128Z 
2025-05-07T20:32:55.5334236Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5334468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5334559Z     T=16384,
2025-05-07T20:32:55.5334641Z     D=5120,
2025-05-07T20:32:55.5334730Z     scale_ub=None,
2025-05-07T20:32:55.5334826Z     contiguous=True,
2025-05-07T20:32:55.5334915Z     compiled=False,
2025-05-07T20:32:55.5334994Z )
2025-05-07T20:32:55.5335228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5335414Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5335419Z 
2025-05-07T20:32:55.5335510Z     @given(
2025-05-07T20:32:55.5335637Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5335741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5335872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5336002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5336126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5336212Z     )
2025-05-07T20:32:55.5336518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5336626Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5336711Z         self,
2025-05-07T20:32:55.5336792Z         T: int,
2025-05-07T20:32:55.5336880Z         D: int,
2025-05-07T20:32:55.5336982Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5337079Z         contiguous: bool,
2025-05-07T20:32:55.5337179Z         compiled: bool,
2025-05-07T20:32:55.5337262Z     ) -> None:
2025-05-07T20:32:55.5337410Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5337496Z     
2025-05-07T20:32:55.5337672Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5339519Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5339526Z 
2025-05-07T20:32:55.5339691Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5339696Z 
2025-05-07T20:32:55.5339809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5340044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5340128Z     T=4096,
2025-05-07T20:32:55.5340217Z     D=5120,
2025-05-07T20:32:55.5340304Z     scale_ub=None,
2025-05-07T20:32:55.5340396Z     contiguous=True,
2025-05-07T20:32:55.5340532Z     compiled=False,
2025-05-07T20:32:55.5340613Z )
2025-05-07T20:32:55.5340879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5341063Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5341068Z 
2025-05-07T20:32:55.5341151Z     @given(
2025-05-07T20:32:55.5341275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5341389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5341508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5341642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5341762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5341840Z     )
2025-05-07T20:32:55.5342107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5342207Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5342290Z         self,
2025-05-07T20:32:55.5342378Z         T: int,
2025-05-07T20:32:55.5342464Z         D: int,
2025-05-07T20:32:55.5342571Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5342672Z         contiguous: bool,
2025-05-07T20:32:55.5342762Z         compiled: bool,
2025-05-07T20:32:55.5342853Z     ) -> None:
2025-05-07T20:32:55.5342951Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5343028Z     
2025-05-07T20:32:55.5343209Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5345033Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5345044Z 
2025-05-07T20:32:55.5345177Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5345181Z 
2025-05-07T20:32:55.5345288Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5345519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5345608Z     T=2048,
2025-05-07T20:32:55.5345688Z     D=5120,
2025-05-07T20:32:55.5345774Z     scale_ub=None,
2025-05-07T20:32:55.5345871Z     contiguous=False,
2025-05-07T20:32:55.5345961Z     compiled=False,
2025-05-07T20:32:55.5346040Z )
2025-05-07T20:32:55.5346271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5346498Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.5346503Z 
2025-05-07T20:32:55.5346591Z     @given(
2025-05-07T20:32:55.5346714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5346823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5346977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5347100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5347245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5347335Z     )
2025-05-07T20:32:55.5347613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5347720Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5347801Z         self,
2025-05-07T20:32:55.5347883Z         T: int,
2025-05-07T20:32:55.5348017Z         D: int,
2025-05-07T20:32:55.5348120Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5348214Z         contiguous: bool,
2025-05-07T20:32:55.5348312Z         compiled: bool,
2025-05-07T20:32:55.5348396Z     ) -> None:
2025-05-07T20:32:55.5348496Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5348579Z     
2025-05-07T20:32:55.5348756Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5350682Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5350690Z 
2025-05-07T20:32:55.5350815Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5350820Z 
2025-05-07T20:32:55.5350937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5351169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5351251Z     T=4096,
2025-05-07T20:32:55.5351338Z     D=7168,
2025-05-07T20:32:55.5351429Z     scale_ub=None,
2025-05-07T20:32:55.5351520Z     contiguous=True,
2025-05-07T20:32:55.5351617Z     compiled=True,
2025-05-07T20:32:55.5351696Z )
2025-05-07T20:32:55.5351922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5352105Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.5352110Z 
2025-05-07T20:32:55.5352192Z     @given(
2025-05-07T20:32:55.5352317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5352433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5352559Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5352689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5352812Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5352891Z     )
2025-05-07T20:32:55.5353154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5353254Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5353344Z         self,
2025-05-07T20:32:55.5353432Z         T: int,
2025-05-07T20:32:55.5353624Z         D: int,
2025-05-07T20:32:55.5353730Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5353833Z         contiguous: bool,
2025-05-07T20:32:55.5353925Z         compiled: bool,
2025-05-07T20:32:55.5354016Z     ) -> None:
2025-05-07T20:32:55.5354117Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5354199Z     
2025-05-07T20:32:55.5354380Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5356303Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5356314Z 
2025-05-07T20:32:55.5356448Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5356453Z 
2025-05-07T20:32:55.5356561Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5356795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5356884Z     T=2048,
2025-05-07T20:32:55.5356966Z     D=5120,
2025-05-07T20:32:55.5357054Z     scale_ub=1200.0,
2025-05-07T20:32:55.5357193Z     contiguous=False,
2025-05-07T20:32:55.5357283Z     compiled=False,
2025-05-07T20:32:55.5357363Z )
2025-05-07T20:32:55.5357599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5357782Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.5357787Z 
2025-05-07T20:32:55.5357877Z     @given(
2025-05-07T20:32:55.5358002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5358209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5358338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5358462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5358582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5358668Z     )
2025-05-07T20:32:55.5358926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5359034Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5359120Z         self,
2025-05-07T20:32:55.5359203Z         T: int,
2025-05-07T20:32:55.5359291Z         D: int,
2025-05-07T20:32:55.5359395Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5359494Z         contiguous: bool,
2025-05-07T20:32:55.5359592Z         compiled: bool,
2025-05-07T20:32:55.5359675Z     ) -> None:
2025-05-07T20:32:55.5359779Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5359861Z     
2025-05-07T20:32:55.5360043Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5361896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5361904Z 
2025-05-07T20:32:55.5362028Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5362035Z 
2025-05-07T20:32:55.5362149Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5362381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5362465Z     T=4096,
2025-05-07T20:32:55.5362555Z     D=7168,
2025-05-07T20:32:55.5362645Z     scale_ub=1200.0,
2025-05-07T20:32:55.5362735Z     contiguous=True,
2025-05-07T20:32:55.5362831Z     compiled=False,
2025-05-07T20:32:55.5362908Z )
2025-05-07T20:32:55.5363138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5363324Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5363329Z 
2025-05-07T20:32:55.5363410Z     @given(
2025-05-07T20:32:55.5363540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5363647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5363767Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5363944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5364065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5364143Z     )
2025-05-07T20:32:55.5364409Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5364512Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5364596Z         self,
2025-05-07T20:32:55.5364684Z         T: int,
2025-05-07T20:32:55.5364765Z         D: int,
2025-05-07T20:32:55.5364868Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5364976Z         contiguous: bool,
2025-05-07T20:32:55.5365070Z         compiled: bool,
2025-05-07T20:32:55.5365159Z     ) -> None:
2025-05-07T20:32:55.5365259Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5365338Z     
2025-05-07T20:32:55.5365517Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5367449Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5367493Z 
2025-05-07T20:32:55.5367623Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5367627Z 
2025-05-07T20:32:55.5367735Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5367967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5368056Z     T=16384,
2025-05-07T20:32:55.5368140Z     D=7168,
2025-05-07T20:32:55.5368229Z     scale_ub=None,
2025-05-07T20:32:55.5368327Z     contiguous=False,
2025-05-07T20:32:55.5368414Z     compiled=True,
2025-05-07T20:32:55.5368499Z )
2025-05-07T20:32:55.5368729Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5368913Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:55.5368917Z 
2025-05-07T20:32:55.5369010Z     @given(
2025-05-07T20:32:55.5369135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5369241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5369370Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5369494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5369617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5369702Z     )
2025-05-07T20:32:55.5369961Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5370071Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5370153Z         self,
2025-05-07T20:32:55.5370235Z         T: int,
2025-05-07T20:32:55.5370323Z         D: int,
2025-05-07T20:32:55.5370429Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5370524Z         contiguous: bool,
2025-05-07T20:32:55.5370623Z         compiled: bool,
2025-05-07T20:32:55.5370706Z     ) -> None:
2025-05-07T20:32:55.5370807Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5370899Z     
2025-05-07T20:32:55.5371079Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5372931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5372939Z 
2025-05-07T20:32:55.5373109Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5373113Z 
2025-05-07T20:32:55.5373230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5373462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5373547Z     T=4096,
2025-05-07T20:32:55.5373640Z     D=7168,
2025-05-07T20:32:55.5373730Z     scale_ub=None,
2025-05-07T20:32:55.5373823Z     contiguous=True,
2025-05-07T20:32:55.5373920Z     compiled=False,
2025-05-07T20:32:55.5373999Z )
2025-05-07T20:32:55.5374227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5374414Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5374418Z 
2025-05-07T20:32:55.5374503Z     @given(
2025-05-07T20:32:55.5374679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5374784Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5374908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5375039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5375161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5375239Z     )
2025-05-07T20:32:55.5375587Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5375688Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5375770Z         self,
2025-05-07T20:32:55.5375858Z         T: int,
2025-05-07T20:32:55.5375943Z         D: int,
2025-05-07T20:32:55.5376047Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5376148Z         contiguous: bool,
2025-05-07T20:32:55.5376239Z         compiled: bool,
2025-05-07T20:32:55.5376328Z     ) -> None:
2025-05-07T20:32:55.5376427Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5376508Z     
2025-05-07T20:32:55.5376691Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5378589Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5378598Z 
2025-05-07T20:32:55.5378727Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5378731Z 
2025-05-07T20:32:55.5378839Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5379070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5379160Z     T=16384,
2025-05-07T20:32:55.5379243Z     D=7168,
2025-05-07T20:32:55.5379331Z     scale_ub=None,
2025-05-07T20:32:55.5379427Z     contiguous=True,
2025-05-07T20:32:55.5379521Z     compiled=False,
2025-05-07T20:32:55.5379605Z )
2025-05-07T20:32:55.5379831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5380019Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:55.5380026Z 
2025-05-07T20:32:55.5380118Z     @given(
2025-05-07T20:32:55.5380242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5380347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5380474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5380597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5380718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5380805Z     )
2025-05-07T20:32:55.5381067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5381176Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5381257Z         self,
2025-05-07T20:32:55.5381893Z         T: int,
2025-05-07T20:32:55.5381986Z         D: int,
2025-05-07T20:32:55.5382091Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5382186Z         contiguous: bool,
2025-05-07T20:32:55.5382284Z         compiled: bool,
2025-05-07T20:32:55.5382368Z     ) -> None:
2025-05-07T20:32:55.5382475Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5382559Z     
2025-05-07T20:32:55.5382734Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5384587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5384639Z 
2025-05-07T20:32:55.5384766Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5384771Z 
2025-05-07T20:32:55.5384885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5385224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5385307Z     T=16384,
2025-05-07T20:32:55.5385395Z     D=7168,
2025-05-07T20:32:55.5385487Z     scale_ub=1200.0,
2025-05-07T20:32:55.5385577Z     contiguous=True,
2025-05-07T20:32:55.5385672Z     compiled=False,
2025-05-07T20:32:55.5385751Z )
2025-05-07T20:32:55.5385976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5386166Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5386173Z 
2025-05-07T20:32:55.5386255Z     @given(
2025-05-07T20:32:55.5386383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5386491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5386612Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5386742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5386863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5386945Z     )
2025-05-07T20:32:55.5387213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5387313Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5387410Z         self,
2025-05-07T20:32:55.5387498Z         T: int,
2025-05-07T20:32:55.5387581Z         D: int,
2025-05-07T20:32:55.5387684Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5387782Z         contiguous: bool,
2025-05-07T20:32:55.5393436Z         compiled: bool,
2025-05-07T20:32:55.5393642Z     ) -> None:
2025-05-07T20:32:55.5393753Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5393839Z     
2025-05-07T20:32:55.5394023Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5395912Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5395921Z 
2025-05-07T20:32:55.5396048Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5396052Z 
2025-05-07T20:32:55.5396167Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5396405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5396488Z     T=128,
2025-05-07T20:32:55.5396575Z     D=5120,
2025-05-07T20:32:55.5396748Z     scale_ub=1200.0,
2025-05-07T20:32:55.5396844Z     contiguous=False,
2025-05-07T20:32:55.5396940Z     compiled=False,
2025-05-07T20:32:55.5397020Z )
2025-05-07T20:32:55.5397251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5397443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:55.5397448Z 
2025-05-07T20:32:55.5397532Z     @given(
2025-05-07T20:32:55.5397665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5397772Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5397895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5398027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5398149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5398276Z     )
2025-05-07T20:32:55.5398546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5398649Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5398746Z         self,
2025-05-07T20:32:55.5398831Z         T: int,
2025-05-07T20:32:55.5398916Z         D: int,
2025-05-07T20:32:55.5399031Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5399170Z         contiguous: bool,
2025-05-07T20:32:55.5399262Z         compiled: bool,
2025-05-07T20:32:55.5399400Z     ) -> None:
2025-05-07T20:32:55.5399503Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5399581Z     
2025-05-07T20:32:55.5399766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5399851Z     
2025-05-07T20:32:55.5399949Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5400095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5400190Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5400280Z         x0 = x[:, :D]
2025-05-07T20:32:55.5400377Z         x1 = x[:, D:]
2025-05-07T20:32:55.5400456Z     
2025-05-07T20:32:55.5400552Z         if contiguous:
2025-05-07T20:32:55.5400654Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5400755Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5400839Z     
2025-05-07T20:32:55.5400937Z         if scale_ub is not None:
2025-05-07T20:32:55.5401051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5401209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5401292Z             )
2025-05-07T20:32:55.5401374Z         else:
2025-05-07T20:32:55.5401480Z             scale_ub_tensor = None
2025-05-07T20:32:55.5401559Z     
2025-05-07T20:32:55.5401697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5401799Z             op = silu_mul_quant
2025-05-07T20:32:55.5401890Z             if compiled:
2025-05-07T20:32:55.5402002Z                 op = torch.compile(op)
2025-05-07T20:32:55.5402118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5402196Z     
2025-05-07T20:32:55.5402301Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5402305Z 
2025-05-07T20:32:55.5402411Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5402550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5402667Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5402777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5403314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5403427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5403808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5404051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5404411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5404514Z     kernel = self.compile(
2025-05-07T20:32:55.5404978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5405168Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5405312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5405319Z 
2025-05-07T20:32:55.5405540Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf691c610>
2025-05-07T20:32:55.5406359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5406899Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6948940>}
2025-05-07T20:32:55.5407725Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5407936Z context = <triton._C.libtriton.ir.context object at 0x7efdf69f5bf0>
2025-05-07T20:32:55.5407983Z 
2025-05-07T20:32:55.5408227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5408507Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5408629Z                            module_map=module_map)
2025-05-07T20:32:55.5408803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5408913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5408996Z E       ^
2025-05-07T20:32:55.5409370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5409378Z 
2025-05-07T20:32:55.5409821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5409826Z 
2025-05-07T20:32:55.5409938Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5410178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5410265Z     T=2048,
2025-05-07T20:32:55.5410347Z     D=7168,
2025-05-07T20:32:55.5410446Z     scale_ub=None,
2025-05-07T20:32:55.5410538Z     contiguous=False,
2025-05-07T20:32:55.5410628Z     compiled=False,
2025-05-07T20:32:55.5410713Z )
2025-05-07T20:32:55.5410944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5411128Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:55.5411132Z 
2025-05-07T20:32:55.5411219Z     @given(
2025-05-07T20:32:55.5411346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5411461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5411584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5411711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5411839Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5411920Z     )
2025-05-07T20:32:55.5412180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5412297Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5412379Z         self,
2025-05-07T20:32:55.5412461Z         T: int,
2025-05-07T20:32:55.5412549Z         D: int,
2025-05-07T20:32:55.5412652Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5412747Z         contiguous: bool,
2025-05-07T20:32:55.5412845Z         compiled: bool,
2025-05-07T20:32:55.5412928Z     ) -> None:
2025-05-07T20:32:55.5413034Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5413114Z     
2025-05-07T20:32:55.5413296Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5415204Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5415212Z 
2025-05-07T20:32:55.5415344Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5415348Z 
2025-05-07T20:32:55.5415469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5415703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5415827Z     T=128,
2025-05-07T20:32:55.5415919Z     D=7168,
2025-05-07T20:32:55.5416009Z     scale_ub=1200.0,
2025-05-07T20:32:55.5416099Z     contiguous=True,
2025-05-07T20:32:55.5416193Z     compiled=True,
2025-05-07T20:32:55.5416274Z )
2025-05-07T20:32:55.5416507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5416688Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5416734Z 
2025-05-07T20:32:55.5416820Z     @given(
2025-05-07T20:32:55.5416990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5417099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5417223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5417354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5417476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5417558Z     )
2025-05-07T20:32:55.5417828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5417933Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5418023Z         self,
2025-05-07T20:32:55.5418107Z         T: int,
2025-05-07T20:32:55.5418192Z         D: int,
2025-05-07T20:32:55.5418304Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5418403Z         contiguous: bool,
2025-05-07T20:32:55.5418496Z         compiled: bool,
2025-05-07T20:32:55.5418588Z     ) -> None:
2025-05-07T20:32:55.5418693Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5418773Z     
2025-05-07T20:32:55.5418959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5419039Z     
2025-05-07T20:32:55.5419136Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5419277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5419373Z         x = x_sign * x_clamp
2025-05-07T20:32:55.5419464Z         x0 = x[:, :D]
2025-05-07T20:32:55.5419550Z         x1 = x[:, D:]
2025-05-07T20:32:55.5419627Z     
2025-05-07T20:32:55.5419727Z         if contiguous:
2025-05-07T20:32:55.5419825Z             x0 = x0.contiguous()
2025-05-07T20:32:55.5419919Z             x1 = x1.contiguous()
2025-05-07T20:32:55.5420004Z     
2025-05-07T20:32:55.5420104Z         if scale_ub is not None:
2025-05-07T20:32:55.5420222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.5420374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.5420457Z             )
2025-05-07T20:32:55.5420539Z         else:
2025-05-07T20:32:55.5420649Z             scale_ub_tensor = None
2025-05-07T20:32:55.5420729Z     
2025-05-07T20:32:55.5420867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.5420975Z             op = silu_mul_quant
2025-05-07T20:32:55.5421067Z             if compiled:
2025-05-07T20:32:55.5421178Z                 op = torch.compile(op)
2025-05-07T20:32:55.5421293Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5421371Z     
2025-05-07T20:32:55.5421473Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:55.5421480Z 
2025-05-07T20:32:55.5421584Z moe/activation_test.py:117: 
2025-05-07T20:32:55.5421721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5421883Z moe/activation_test.py:115: in fn
2025-05-07T20:32:55.5421992Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.5422388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:55.5422492Z     return fn(*args, **kwargs)
2025-05-07T20:32:55.5423010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:55.5423120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:55.5423498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.5423739Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.5425188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.5425325Z     kernel = self.compile(
2025-05-07T20:32:55.5425871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.5426115Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.5426581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.5426587Z 
2025-05-07T20:32:55.5426821Z self = <triton.compiler.compiler.ASTSource object at 0x7efdf67286d0>
2025-05-07T20:32:55.5427635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.5428179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7eff62d8eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdf6948dc0>}
2025-05-07T20:32:55.5428957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.5429160Z context = <triton._C.libtriton.ir.context object at 0x7efdf6781270>
2025-05-07T20:32:55.5429176Z 
2025-05-07T20:32:55.5429354Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.5429630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.5429750Z                            module_map=module_map)
2025-05-07T20:32:55.5429923Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.5430028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.5430116Z E       ^
2025-05-07T20:32:55.5430494Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.5430499Z 
2025-05-07T20:32:55.5430943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.5430947Z 
2025-05-07T20:32:55.5431058Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5431295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5431385Z     T=128,
2025-05-07T20:32:55.5431468Z     D=7168,
2025-05-07T20:32:55.5431558Z     scale_ub=1200.0,
2025-05-07T20:32:55.5431654Z     contiguous=True,
2025-05-07T20:32:55.5431744Z     compiled=False,
2025-05-07T20:32:55.5431827Z )
2025-05-07T20:32:55.5432061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5432239Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:55.5432243Z 
2025-05-07T20:32:55.5432337Z     @given(
2025-05-07T20:32:55.5432463Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5432571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5432784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5432914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5433036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5433126Z     )
2025-05-07T20:32:55.5433387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5433488Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5433734Z         self,
2025-05-07T20:32:55.5433818Z         T: int,
2025-05-07T20:32:55.5433906Z         D: int,
2025-05-07T20:32:55.5434009Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5434104Z         contiguous: bool,
2025-05-07T20:32:55.5434202Z         compiled: bool,
2025-05-07T20:32:55.5434287Z     ) -> None:
2025-05-07T20:32:55.5434387Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5434548Z     
2025-05-07T20:32:55.5434727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5434804Z     
2025-05-07T20:32:55.5434913Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5435044Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5436996Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5437044Z 
2025-05-07T20:32:55.5437175Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.5437182Z 
2025-05-07T20:32:55.5437296Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5437530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5437612Z     T=128,
2025-05-07T20:32:55.5437703Z     D=5120,
2025-05-07T20:32:55.5437791Z     scale_ub=1200.0,
2025-05-07T20:32:55.5438115Z     contiguous=True,
2025-05-07T20:32:55.5438212Z     compiled=True,
2025-05-07T20:32:55.5438295Z )
2025-05-07T20:32:55.5438540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5438729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:55.5438734Z 
2025-05-07T20:32:55.5438817Z     @given(
2025-05-07T20:32:55.5438943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5439088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5439211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5439341Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5439464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5439544Z     )
2025-05-07T20:32:55.5439811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5439912Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5439995Z         self,
2025-05-07T20:32:55.5440082Z         T: int,
2025-05-07T20:32:55.5440165Z         D: int,
2025-05-07T20:32:55.5440279Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5440376Z         contiguous: bool,
2025-05-07T20:32:55.5440470Z         compiled: bool,
2025-05-07T20:32:55.5440562Z     ) -> None:
2025-05-07T20:32:55.5440664Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5440758Z     
2025-05-07T20:32:55.5440934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5441013Z     
2025-05-07T20:32:55.5441116Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.5441247Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.5443152Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5443161Z 
2025-05-07T20:32:55.5443287Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:55.5443291Z 
2025-05-07T20:32:55.5443407Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.5443645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.5443730Z     T=128,
2025-05-07T20:32:55.5443818Z     D=7168,
2025-05-07T20:32:55.5443954Z     scale_ub=None,
2025-05-07T20:32:55.5444047Z     contiguous=True,
2025-05-07T20:32:55.5444145Z     compiled=True,
2025-05-07T20:32:55.5444225Z )
2025-05-07T20:32:55.5444454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.5444636Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.5444640Z 
2025-05-07T20:32:55.5444763Z     @given(
2025-05-07T20:32:55.5444931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.5445039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.5445161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.5445292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.5445413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.5445493Z     )
2025-05-07T20:32:55.5445758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.5445863Z     def test_silu_mul_quant(
2025-05-07T20:32:55.5445944Z         self,
2025-05-07T20:32:55.5446033Z         T: int,
2025-05-07T20:32:55.5446114Z         D: int,
2025-05-07T20:32:55.5446226Z         scale_ub: Optional[float],
2025-05-07T20:32:55.5446322Z         contiguous: bool,
2025-05-07T20:32:55.5446415Z         compiled: bool,
2025-05-07T20:32:55.5446504Z     ) -> None:
2025-05-07T20:32:55.5446603Z         torch.manual_seed(2025)
2025-05-07T20:32:55.5446688Z     
2025-05-07T20:32:55.5446873Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.5448705Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:55.5448713Z 
2025-05-07T20:32:55.5448849Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:55.5448991Z =============================== warnings summary ===============================
2025-05-07T20:32:55.5449313Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:55.5449641Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:55.5449952Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:55.5450869Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:55.5451114Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:55.5451119Z 
2025-05-07T20:32:55.5451391Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:55.5451571Z ================= 1 failed, 1 deselected, 3 warnings in 22.74s =================
2025-05-07T20:32:57.2134191Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:57.2771613Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:57.2771869Z 
2025-05-07T20:32:59.2790637Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:01.5904003Z ============================= test session starts ==============================
2025-05-07T20:33:01.5905158Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:01.5905746Z cachedir: .pytest_cache
2025-05-07T20:33:01.5906392Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:01.5907396Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:01.5907851Z plugins: hypothesis-6.131.14
2025-05-07T20:33:03.2442745Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:03.4268285Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:03.4268899Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:03.4269205Z 
2025-05-07T20:33:06.0110178Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.0111054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.0111536Z     T=1,
2025-05-07T20:33:06.0111759Z     D=5120,
2025-05-07T20:33:06.0111987Z     scale_ub=None,
2025-05-07T20:33:06.0112242Z     contiguous=True,
2025-05-07T20:33:06.0112507Z     compiled=True,
2025-05-07T20:33:06.0112748Z )
2025-05-07T20:33:06.0113123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.0113818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.0114118Z 
2025-05-07T20:33:06.0114219Z     @given(
2025-05-07T20:33:06.0114490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.0114855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.0115211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.0115590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.0115979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.0116316Z     )
2025-05-07T20:33:06.0116723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.0117286Z     def test_silu_mul_quant(
2025-05-07T20:33:06.0117690Z         self,
2025-05-07T20:33:06.0117939Z         T: int,
2025-05-07T20:33:06.0118167Z         D: int,
2025-05-07T20:33:06.0118424Z         scale_ub: Optional[float],
2025-05-07T20:33:06.0118743Z         contiguous: bool,
2025-05-07T20:33:06.0119020Z         compiled: bool,
2025-05-07T20:33:06.0119287Z     ) -> None:
2025-05-07T20:33:06.0119543Z         torch.manual_seed(2025)
2025-05-07T20:33:06.0119825Z     
2025-05-07T20:33:06.0120219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.0120639Z     
2025-05-07T20:33:06.0120864Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.0121203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.0121566Z         x = x_sign * x_clamp
2025-05-07T20:33:06.0121843Z         x0 = x[:, :D]
2025-05-07T20:33:06.0122104Z         x1 = x[:, D:]
2025-05-07T20:33:06.0122360Z     
2025-05-07T20:33:06.0122576Z         if contiguous:
2025-05-07T20:33:06.0123232Z             x0 = x0.contiguous()
2025-05-07T20:33:06.0123545Z             x1 = x1.contiguous()
2025-05-07T20:33:06.0124168Z     
2025-05-07T20:33:06.0124406Z         if scale_ub is not None:
2025-05-07T20:33:06.0124732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.0125129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.0125483Z             )
2025-05-07T20:33:06.0125712Z         else:
2025-05-07T20:33:06.0125958Z             scale_ub_tensor = None
2025-05-07T20:33:06.0126245Z     
2025-05-07T20:33:06.0126517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0126887Z             op = silu_mul_quant
2025-05-07T20:33:06.0127173Z             if compiled:
2025-05-07T20:33:06.0127463Z                 op = torch.compile(op)
2025-05-07T20:33:06.0127807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.0128226Z     
2025-05-07T20:33:06.0128455Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.0128790Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.0129120Z     
2025-05-07T20:33:06.0129410Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.0129837Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.0130275Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.0130710Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.0131132Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.0131492Z     
2025-05-07T20:33:06.0131722Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.0131951Z 
2025-05-07T20:33:06.0132069Z moe/activation_test.py:126: 
2025-05-07T20:33:06.0132413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0132797Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.0133181Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.0134088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.0134949Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.0135571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.0136361Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.0137147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.0137974Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.0138830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:06.0139694Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.0140535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.0141263Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.0141954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.0142556Z     fn()
2025-05-07T20:33:06.0143148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.0143811Z     self.fn.run(
2025-05-07T20:33:06.0144355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.0144964Z     kernel = self.compile(
2025-05-07T20:33:06.0145579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.0146334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.0146883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.0147147Z 
2025-05-07T20:33:06.0147392Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7acbe80>
2025-05-07T20:33:06.0148620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.0150215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb7bb8af0>}
2025-05-07T20:33:06.0151749Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.0152972Z context = <triton._C.libtriton.ir.context object at 0x7efdbdb35d70>
2025-05-07T20:33:06.0153303Z 
2025-05-07T20:33:06.0153615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.0154212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.0154821Z                            module_map=module_map)
2025-05-07T20:33:06.0155287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.0155693Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.0156002Z E       ^
2025-05-07T20:33:06.0156536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.0157048Z 
2025-05-07T20:33:06.0157528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.0158111Z 
2025-05-07T20:33:06.0158231Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.0158704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.0159169Z     T=2048,
2025-05-07T20:33:06.0159385Z     D=5120,
2025-05-07T20:33:06.0159608Z     scale_ub=1200.0,
2025-05-07T20:33:06.0159869Z     contiguous=True,
2025-05-07T20:33:06.0160120Z     compiled=False,
2025-05-07T20:33:06.0160369Z )
2025-05-07T20:33:07.5325390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5326539Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5327098Z 
2025-05-07T20:33:07.5327271Z     @given(
2025-05-07T20:33:07.5327740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5328385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5329009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5329725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5330357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5330774Z     )
2025-05-07T20:33:07.5331272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5331781Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5332069Z         self,
2025-05-07T20:33:07.5332301Z         T: int,
2025-05-07T20:33:07.5332533Z         D: int,
2025-05-07T20:33:07.5332791Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5333111Z         contiguous: bool,
2025-05-07T20:33:07.5333385Z         compiled: bool,
2025-05-07T20:33:07.5333651Z     ) -> None:
2025-05-07T20:33:07.5333909Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5334185Z     
2025-05-07T20:33:07.5334501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5334898Z     
2025-05-07T20:33:07.5335120Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5335458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5335820Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5336105Z         x0 = x[:, :D]
2025-05-07T20:33:07.5336354Z         x1 = x[:, D:]
2025-05-07T20:33:07.5336896Z     
2025-05-07T20:33:07.5337122Z         if contiguous:
2025-05-07T20:33:07.5337387Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5337685Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5337965Z     
2025-05-07T20:33:07.5338191Z         if scale_ub is not None:
2025-05-07T20:33:07.5338515Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5338904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5339256Z             )
2025-05-07T20:33:07.5339483Z         else:
2025-05-07T20:33:07.5339732Z             scale_ub_tensor = None
2025-05-07T20:33:07.5340047Z     
2025-05-07T20:33:07.5340357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5340744Z             op = silu_mul_quant
2025-05-07T20:33:07.5341028Z             if compiled:
2025-05-07T20:33:07.5341425Z                 op = torch.compile(op)
2025-05-07T20:33:07.5341770Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5342086Z     
2025-05-07T20:33:07.5349638Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5349858Z 
2025-05-07T20:33:07.5350006Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5350376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5350906Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5351321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5352122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5352921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5353638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5354428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5355190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5355812Z     kernel = self.compile(
2025-05-07T20:33:07.5356442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5357201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5357665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5357939Z 
2025-05-07T20:33:07.5358179Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb799d960>
2025-05-07T20:33:07.5359414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5361043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb7a95990>}
2025-05-07T20:33:07.5362582Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5363755Z context = <triton._C.libtriton.ir.context object at 0x7efdbc266770>
2025-05-07T20:33:07.5364090Z 
2025-05-07T20:33:07.5364292Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5364892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5365426Z                            module_map=module_map)
2025-05-07T20:33:07.5365851Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5366258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5366558Z E       ^
2025-05-07T20:33:07.5367093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5367606Z 
2025-05-07T20:33:07.5368144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5368730Z 
2025-05-07T20:33:07.5368863Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5369341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5369805Z     T=2048,
2025-05-07T20:33:07.5370025Z     D=5120,
2025-05-07T20:33:07.5370246Z     scale_ub=1200.0,
2025-05-07T20:33:07.5370509Z     contiguous=True,
2025-05-07T20:33:07.5370770Z     compiled=True,
2025-05-07T20:33:07.5371007Z )
2025-05-07T20:33:07.5371377Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5371949Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5372307Z 
2025-05-07T20:33:07.5372408Z     @given(
2025-05-07T20:33:07.5372671Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5373035Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5373393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5373773Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5374158Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5374542Z     )
2025-05-07T20:33:07.5374986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5375502Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5375792Z         self,
2025-05-07T20:33:07.5376017Z         T: int,
2025-05-07T20:33:07.5376254Z         D: int,
2025-05-07T20:33:07.5376514Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5376825Z         contiguous: bool,
2025-05-07T20:33:07.5377109Z         compiled: bool,
2025-05-07T20:33:07.5377372Z     ) -> None:
2025-05-07T20:33:07.5377622Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5377910Z     
2025-05-07T20:33:07.5378230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5378630Z     
2025-05-07T20:33:07.5378856Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5379197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5379556Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5379839Z         x0 = x[:, :D]
2025-05-07T20:33:07.5380120Z         x1 = x[:, D:]
2025-05-07T20:33:07.5380387Z     
2025-05-07T20:33:07.5380600Z         if contiguous:
2025-05-07T20:33:07.5380874Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5381174Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5381449Z     
2025-05-07T20:33:07.5381675Z         if scale_ub is not None:
2025-05-07T20:33:07.5381993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5382375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5382737Z             )
2025-05-07T20:33:07.5382966Z         else:
2025-05-07T20:33:07.5383208Z             scale_ub_tensor = None
2025-05-07T20:33:07.5383501Z     
2025-05-07T20:33:07.5383774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5384138Z             op = silu_mul_quant
2025-05-07T20:33:07.5384424Z             if compiled:
2025-05-07T20:33:07.5384711Z                 op = torch.compile(op)
2025-05-07T20:33:07.5385058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5385374Z     
2025-05-07T20:33:07.5385598Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.5385931Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.5386263Z     
2025-05-07T20:33:07.5386543Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5386932Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.5387263Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.5387633Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.5388053Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5388416Z     
2025-05-07T20:33:07.5388703Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.5388936Z 
2025-05-07T20:33:07.5389054Z moe/activation_test.py:126: 
2025-05-07T20:33:07.5389403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5389792Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.5390224Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.5391131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.5391997Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.5392623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5393407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5394322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.5395152Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5396022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:07.5396976Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.5397816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.5398546Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.5399240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.5399843Z     fn()
2025-05-07T20:33:07.5400461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.5401119Z     self.fn.run(
2025-05-07T20:33:07.5401659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5402261Z     kernel = self.compile(
2025-05-07T20:33:07.5402873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5403626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5404077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5404340Z 
2025-05-07T20:33:07.5404583Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7ac86d0>
2025-05-07T20:33:07.5405799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5407365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb65256c0>}
2025-05-07T20:33:07.5408895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5410073Z context = <triton._C.libtriton.ir.context object at 0x7efdb643dd30>
2025-05-07T20:33:07.5410445Z 
2025-05-07T20:33:07.5410635Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5411232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5411770Z                            module_map=module_map)
2025-05-07T20:33:07.5412187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5412595Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.5412905Z E       ^
2025-05-07T20:33:07.5413493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5414004Z 
2025-05-07T20:33:07.5414480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5415064Z 
2025-05-07T20:33:07.5415186Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5415659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5416119Z     T=16384,
2025-05-07T20:33:07.5416339Z     D=7168,
2025-05-07T20:33:07.5416566Z     scale_ub=1200.0,
2025-05-07T20:33:07.5416829Z     contiguous=False,
2025-05-07T20:33:07.5417082Z     compiled=False,
2025-05-07T20:33:07.5417324Z )
2025-05-07T20:33:08.8505334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8506312Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.8506739Z 
2025-05-07T20:33:08.8506866Z     @given(
2025-05-07T20:33:08.8507207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8507607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8507956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8508420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8508852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8509180Z     )
2025-05-07T20:33:08.8509574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8510064Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8510358Z         self,
2025-05-07T20:33:08.8510606Z         T: int,
2025-05-07T20:33:08.8510820Z         D: int,
2025-05-07T20:33:08.8511065Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8511370Z         contiguous: bool,
2025-05-07T20:33:08.8511634Z         compiled: bool,
2025-05-07T20:33:08.8511886Z     ) -> None:
2025-05-07T20:33:08.8512128Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8512396Z     
2025-05-07T20:33:08.8512705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8513087Z     
2025-05-07T20:33:08.8513296Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8513695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8514043Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8514314Z         x0 = x[:, :D]
2025-05-07T20:33:08.8514551Z         x1 = x[:, D:]
2025-05-07T20:33:08.8514785Z     
2025-05-07T20:33:08.8514997Z         if contiguous:
2025-05-07T20:33:08.8515251Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8515538Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8515818Z     
2025-05-07T20:33:08.8516027Z         if scale_ub is not None:
2025-05-07T20:33:08.8516349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8516728Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8517068Z             )
2025-05-07T20:33:08.8517285Z         else:
2025-05-07T20:33:08.8517528Z             scale_ub_tensor = None
2025-05-07T20:33:08.8517818Z     
2025-05-07T20:33:08.8518074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8518426Z             op = silu_mul_quant
2025-05-07T20:33:08.8518714Z             if compiled:
2025-05-07T20:33:08.8518991Z                 op = torch.compile(op)
2025-05-07T20:33:08.8519324Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8519633Z     
2025-05-07T20:33:08.8519845Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8520036Z 
2025-05-07T20:33:08.8520149Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8520524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8520897Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8521214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8521989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8522848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8523441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8524613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8525354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8525948Z     kernel = self.compile(
2025-05-07T20:33:08.8526543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8527271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8527713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8528057Z 
2025-05-07T20:33:08.8528286Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb6857bb0>
2025-05-07T20:33:08.8529491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8531142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb65248b0>}
2025-05-07T20:33:08.8532632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8533763Z context = <triton._C.libtriton.ir.context object at 0x7efdb64c7230>
2025-05-07T20:33:08.8534088Z 
2025-05-07T20:33:08.8534272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8534849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8535368Z                            module_map=module_map)
2025-05-07T20:33:08.8535780Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8536163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8536455Z E       ^
2025-05-07T20:33:08.8536974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8537467Z 
2025-05-07T20:33:08.8537924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8538492Z 
2025-05-07T20:33:08.8538608Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8539071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8539516Z     T=1,
2025-05-07T20:33:08.8539717Z     D=7168,
2025-05-07T20:33:08.8539934Z     scale_ub=None,
2025-05-07T20:33:08.8540176Z     contiguous=True,
2025-05-07T20:33:08.8540421Z     compiled=True,
2025-05-07T20:33:08.8540650Z )
2025-05-07T20:33:08.8541003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8541532Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.8541826Z 
2025-05-07T20:33:08.8541918Z     @given(
2025-05-07T20:33:08.8542180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8542520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8542863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8543231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8543598Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8543910Z     )
2025-05-07T20:33:08.8544305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8544799Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8545063Z         self,
2025-05-07T20:33:08.8545282Z         T: int,
2025-05-07T20:33:08.8545579Z         D: int,
2025-05-07T20:33:08.8545819Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8546126Z         contiguous: bool,
2025-05-07T20:33:08.8546397Z         compiled: bool,
2025-05-07T20:33:08.8546647Z     ) -> None:
2025-05-07T20:33:08.8546888Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8547159Z     
2025-05-07T20:33:08.8547459Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8547842Z     
2025-05-07T20:33:08.8548062Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8548380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8548725Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8548992Z         x0 = x[:, :D]
2025-05-07T20:33:08.8549235Z         x1 = x[:, D:]
2025-05-07T20:33:08.8549512Z     
2025-05-07T20:33:08.8549724Z         if contiguous:
2025-05-07T20:33:08.8549982Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8550277Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8550569Z     
2025-05-07T20:33:08.8550827Z         if scale_ub is not None:
2025-05-07T20:33:08.8551136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8551512Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8551908Z             )
2025-05-07T20:33:08.8552159Z         else:
2025-05-07T20:33:08.8552404Z             scale_ub_tensor = None
2025-05-07T20:33:08.8552689Z     
2025-05-07T20:33:08.8552941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8553293Z             op = silu_mul_quant
2025-05-07T20:33:08.8553651Z             if compiled:
2025-05-07T20:33:08.8553925Z                 op = torch.compile(op)
2025-05-07T20:33:08.8554255Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8554565Z     
2025-05-07T20:33:08.8554784Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.8555101Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.8555427Z     
2025-05-07T20:33:08.8555696Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8556062Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.8556388Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.8556739Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.8557137Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.8557487Z     
2025-05-07T20:33:08.8557713Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.8557929Z 
2025-05-07T20:33:08.8558043Z moe/activation_test.py:126: 
2025-05-07T20:33:08.8558376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8558755Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.8559130Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.8560008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.8560853Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.8561466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8562230Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8562990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.8563791Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.8564627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:08.8565452Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.8566266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.8567028Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.8567694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.8568263Z     fn()
2025-05-07T20:33:08.8568835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.8569474Z     self.fn.run(
2025-05-07T20:33:08.8569987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8570625Z     kernel = self.compile(
2025-05-07T20:33:08.8571219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8571939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8572423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8572684Z 
2025-05-07T20:33:08.8572918Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7bb3d00>
2025-05-07T20:33:08.8574149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8575737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c4e50>}
2025-05-07T20:33:08.8577222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8578348Z context = <triton._C.libtriton.ir.context object at 0x7efdb63ec570>
2025-05-07T20:33:08.8578671Z 
2025-05-07T20:33:08.8578856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8579436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8579953Z                            module_map=module_map)
2025-05-07T20:33:08.8580363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8580764Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.8581060Z E       ^
2025-05-07T20:33:08.8581569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8582070Z 
2025-05-07T20:33:08.8582528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8583092Z 
2025-05-07T20:33:08.8583218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8583679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8584119Z     T=4096,
2025-05-07T20:33:08.8584331Z     D=5120,
2025-05-07T20:33:08.8584550Z     scale_ub=None,
2025-05-07T20:33:08.8584788Z     contiguous=False,
2025-05-07T20:33:08.8585042Z     compiled=False,
2025-05-07T20:33:08.8585273Z )
2025-05-07T20:33:10.4835628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.4836386Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.4836811Z 
2025-05-07T20:33:10.4836941Z     @given(
2025-05-07T20:33:10.4837307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.4837766Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.4838208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.4838673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.4839030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.4839360Z     )
2025-05-07T20:33:10.4839746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.4840363Z     def test_silu_mul_quant(
2025-05-07T20:33:10.4840641Z         self,
2025-05-07T20:33:10.4840868Z         T: int,
2025-05-07T20:33:10.4841091Z         D: int,
2025-05-07T20:33:10.4841328Z         scale_ub: Optional[float],
2025-05-07T20:33:10.4841628Z         contiguous: bool,
2025-05-07T20:33:10.4841898Z         compiled: bool,
2025-05-07T20:33:10.4842142Z     ) -> None:
2025-05-07T20:33:10.4842388Z         torch.manual_seed(2025)
2025-05-07T20:33:10.4842667Z     
2025-05-07T20:33:10.4842963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.4843335Z     
2025-05-07T20:33:10.4843551Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.4843865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.4844201Z         x = x_sign * x_clamp
2025-05-07T20:33:10.4844530Z         x0 = x[:, :D]
2025-05-07T20:33:10.4844760Z         x1 = x[:, D:]
2025-05-07T20:33:10.4844991Z     
2025-05-07T20:33:10.4845194Z         if contiguous:
2025-05-07T20:33:10.4845443Z             x0 = x0.contiguous()
2025-05-07T20:33:10.4845726Z             x1 = x1.contiguous()
2025-05-07T20:33:10.4845987Z     
2025-05-07T20:33:10.4846191Z         if scale_ub is not None:
2025-05-07T20:33:10.4846488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.4846973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.4847308Z             )
2025-05-07T20:33:10.4847513Z         else:
2025-05-07T20:33:10.4847741Z             scale_ub_tensor = None
2025-05-07T20:33:10.4848012Z     
2025-05-07T20:33:10.4848260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.4848595Z             op = silu_mul_quant
2025-05-07T20:33:10.4848862Z             if compiled:
2025-05-07T20:33:10.4849145Z                 op = torch.compile(op)
2025-05-07T20:33:10.4849473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4849763Z     
2025-05-07T20:33:10.4849973Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.4850150Z 
2025-05-07T20:33:10.4850266Z moe/activation_test.py:117: 
2025-05-07T20:33:10.4850585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4850936Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.4857520Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4858268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.4858991Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.4859566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.4860288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.4861036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.4861594Z     kernel = self.compile(
2025-05-07T20:33:10.4862175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.4862866Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.4863284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4863532Z 
2025-05-07T20:33:10.4863754Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb62137c0>
2025-05-07T20:33:10.4864880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.4866322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c5630>}
2025-05-07T20:33:10.4867804Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.4868870Z context = <triton._C.libtriton.ir.context object at 0x7efdb61732f0>
2025-05-07T20:33:10.4869180Z 
2025-05-07T20:33:10.4869359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.4869906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.4870399Z                            module_map=module_map)
2025-05-07T20:33:10.4870843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.4871227Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.4871502Z E       ^
2025-05-07T20:33:10.4872064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.4872677Z 
2025-05-07T20:33:10.4873201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.4873870Z 
2025-05-07T20:33:10.4873982Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.4874417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.4874890Z     T=4096,
2025-05-07T20:33:10.4875146Z     D=7168,
2025-05-07T20:33:10.4875357Z     scale_ub=None,
2025-05-07T20:33:10.4875588Z     contiguous=False,
2025-05-07T20:33:10.4875826Z     compiled=False,
2025-05-07T20:33:10.4876054Z )
2025-05-07T20:33:10.4876394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.4876908Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.4877238Z 
2025-05-07T20:33:10.4877359Z     @given(
2025-05-07T20:33:10.4877685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.4878011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.4878341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.4878692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.4879042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.4879342Z     )
2025-05-07T20:33:10.4879718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.4880190Z     def test_silu_mul_quant(
2025-05-07T20:33:10.4880444Z         self,
2025-05-07T20:33:10.4880658Z         T: int,
2025-05-07T20:33:10.4880898Z         D: int,
2025-05-07T20:33:10.4881152Z         scale_ub: Optional[float],
2025-05-07T20:33:10.4881440Z         contiguous: bool,
2025-05-07T20:33:10.4881700Z         compiled: bool,
2025-05-07T20:33:10.4881939Z     ) -> None:
2025-05-07T20:33:10.4882174Z         torch.manual_seed(2025)
2025-05-07T20:33:10.4882435Z     
2025-05-07T20:33:10.4882724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.4883085Z     
2025-05-07T20:33:10.4883295Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.4883614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.4883939Z         x = x_sign * x_clamp
2025-05-07T20:33:10.4884196Z         x0 = x[:, :D]
2025-05-07T20:33:10.4884427Z         x1 = x[:, D:]
2025-05-07T20:33:10.4884645Z     
2025-05-07T20:33:10.4884845Z         if contiguous:
2025-05-07T20:33:10.4885096Z             x0 = x0.contiguous()
2025-05-07T20:33:10.4885368Z             x1 = x1.contiguous()
2025-05-07T20:33:10.4885624Z     
2025-05-07T20:33:10.4885835Z         if scale_ub is not None:
2025-05-07T20:33:10.4886120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.4886475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.4886805Z             )
2025-05-07T20:33:10.4887009Z         else:
2025-05-07T20:33:10.4887240Z             scale_ub_tensor = None
2025-05-07T20:33:10.4887510Z     
2025-05-07T20:33:10.4887752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.4888147Z             op = silu_mul_quant
2025-05-07T20:33:10.4888417Z             if compiled:
2025-05-07T20:33:10.4888676Z                 op = torch.compile(op)
2025-05-07T20:33:10.4888990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4889284Z     
2025-05-07T20:33:10.4889493Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.4889669Z 
2025-05-07T20:33:10.4889776Z moe/activation_test.py:117: 
2025-05-07T20:33:10.4890092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4890444Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.4890738Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4891466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.4892236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.4892800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.4893514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.4894209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.4894815Z     kernel = self.compile(
2025-05-07T20:33:10.4895422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.4896114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.4896533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4896773Z 
2025-05-07T20:33:10.4897001Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb63050c0>
2025-05-07T20:33:10.4898128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.4899569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c5f30>}
2025-05-07T20:33:10.4900984Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.4902060Z context = <triton._C.libtriton.ir.context object at 0x7efdb616a470>
2025-05-07T20:33:10.4902363Z 
2025-05-07T20:33:10.4902547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.4903091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.4903589Z                            module_map=module_map)
2025-05-07T20:33:10.4903974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.4904345Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.4904626Z E       ^
2025-05-07T20:33:10.4905121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.4905595Z 
2025-05-07T20:33:10.4906039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.4906575Z 
2025-05-07T20:33:10.4906685Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.4907121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.4907540Z     T=128,
2025-05-07T20:33:10.4907736Z     D=7168,
2025-05-07T20:33:10.4907945Z     scale_ub=None,
2025-05-07T20:33:10.4908174Z     contiguous=False,
2025-05-07T20:33:10.4908409Z     compiled=True,
2025-05-07T20:33:10.4908632Z )
2025-05-07T20:33:10.5535886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.5536792Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:10.5537194Z 
2025-05-07T20:33:10.5537291Z     @given(
2025-05-07T20:33:10.5537544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.5537890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.5538222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.5538577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.5538931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.5539237Z     )
2025-05-07T20:33:10.5539605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.5540079Z     def test_silu_mul_quant(
2025-05-07T20:33:10.5540341Z         self,
2025-05-07T20:33:10.5540555Z         T: int,
2025-05-07T20:33:10.5540840Z         D: int,
2025-05-07T20:33:10.5541081Z         scale_ub: Optional[float],
2025-05-07T20:33:10.5541372Z         contiguous: bool,
2025-05-07T20:33:10.5541627Z         compiled: bool,
2025-05-07T20:33:10.5541873Z     ) -> None:
2025-05-07T20:33:10.5542108Z         torch.manual_seed(2025)
2025-05-07T20:33:10.5542363Z     
2025-05-07T20:33:10.5542655Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.5543097Z     
2025-05-07T20:33:10.5543357Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.5543675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.5544010Z         x = x_sign * x_clamp
2025-05-07T20:33:10.5544267Z         x0 = x[:, :D]
2025-05-07T20:33:10.5544501Z         x1 = x[:, D:]
2025-05-07T20:33:10.5544731Z     
2025-05-07T20:33:10.5544929Z         if contiguous:
2025-05-07T20:33:10.5545186Z             x0 = x0.contiguous()
2025-05-07T20:33:10.5545465Z             x1 = x1.contiguous()
2025-05-07T20:33:10.5545727Z     
2025-05-07T20:33:10.5545936Z         if scale_ub is not None:
2025-05-07T20:33:10.5546235Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.5546594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.5546919Z             )
2025-05-07T20:33:10.5547127Z         else:
2025-05-07T20:33:10.5547353Z             scale_ub_tensor = None
2025-05-07T20:33:10.5547618Z     
2025-05-07T20:33:10.5547873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.5548210Z             op = silu_mul_quant
2025-05-07T20:33:10.5548473Z             if compiled:
2025-05-07T20:33:10.5548744Z                 op = torch.compile(op)
2025-05-07T20:33:10.5549058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.5549343Z     
2025-05-07T20:33:10.5549556Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.5549862Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.5550163Z     
2025-05-07T20:33:10.5550421Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.5550779Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.5551089Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.5551420Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.5551800Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.5552127Z     
2025-05-07T20:33:10.5552343Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.5552556Z 
2025-05-07T20:33:10.5552665Z moe/activation_test.py:126: 
2025-05-07T20:33:10.5552980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5553329Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.5553762Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.5554588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.5555380Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.5555953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.5556726Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.5557450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.5558222Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.5559008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:10.5559797Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.5560560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.5561280Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.5561975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.5562522Z     fn()
2025-05-07T20:33:10.5563060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.5563675Z     self.fn.run(
2025-05-07T20:33:10.5564275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.5564832Z     kernel = self.compile(
2025-05-07T20:33:10.5565403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.5566090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.5566507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5566753Z 
2025-05-07T20:33:10.5566971Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb6307b80>
2025-05-07T20:33:10.5568106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.5569550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6526a70>}
2025-05-07T20:33:10.5571013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.5572079Z context = <triton._C.libtriton.ir.context object at 0x7efd91c3baf0>
2025-05-07T20:33:10.5572390Z 
2025-05-07T20:33:10.5572567Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.5573117Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.5573612Z                            module_map=module_map)
2025-05-07T20:33:10.5573996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.5574373Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.5574660Z E       ^
2025-05-07T20:33:10.5575150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.5575631Z 
2025-05-07T20:33:10.5576065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.5576602Z 
2025-05-07T20:33:10.5576715Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.5577153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.5577570Z     T=128,
2025-05-07T20:33:10.5577773Z     D=7168,
2025-05-07T20:33:10.5577982Z     scale_ub=None,
2025-05-07T20:33:10.5578208Z     contiguous=False,
2025-05-07T20:33:10.5578453Z     compiled=False,
2025-05-07T20:33:10.5578675Z )
2025-05-07T20:33:10.9194402Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9195599Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.9196186Z 
2025-05-07T20:33:10.9196357Z     @given(
2025-05-07T20:33:10.9196860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9197522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9198170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9198870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9199554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9200161Z     )
2025-05-07T20:33:10.9200814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9201277Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9201615Z         self,
2025-05-07T20:33:10.9201832Z         T: int,
2025-05-07T20:33:10.9202045Z         D: int,
2025-05-07T20:33:10.9202284Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9202585Z         contiguous: bool,
2025-05-07T20:33:10.9202844Z         compiled: bool,
2025-05-07T20:33:10.9203091Z     ) -> None:
2025-05-07T20:33:10.9203327Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9203659Z     
2025-05-07T20:33:10.9204003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9204373Z     
2025-05-07T20:33:10.9204589Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.9204897Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.9205234Z         x = x_sign * x_clamp
2025-05-07T20:33:10.9205501Z         x0 = x[:, :D]
2025-05-07T20:33:10.9205730Z         x1 = x[:, D:]
2025-05-07T20:33:10.9205958Z     
2025-05-07T20:33:10.9206163Z         if contiguous:
2025-05-07T20:33:10.9206406Z             x0 = x0.contiguous()
2025-05-07T20:33:10.9206691Z             x1 = x1.contiguous()
2025-05-07T20:33:10.9206955Z     
2025-05-07T20:33:10.9207169Z         if scale_ub is not None:
2025-05-07T20:33:10.9207471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.9207837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.9208161Z             )
2025-05-07T20:33:10.9208379Z         else:
2025-05-07T20:33:10.9208618Z             scale_ub_tensor = None
2025-05-07T20:33:10.9208892Z     
2025-05-07T20:33:10.9209140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.9209482Z             op = silu_mul_quant
2025-05-07T20:33:10.9209754Z             if compiled:
2025-05-07T20:33:10.9210015Z                 op = torch.compile(op)
2025-05-07T20:33:10.9210328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9210619Z     
2025-05-07T20:33:10.9210821Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.9210999Z 
2025-05-07T20:33:10.9211109Z moe/activation_test.py:117: 
2025-05-07T20:33:10.9211426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9211770Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.9212076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9212811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.9213543Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.9214104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.9214818Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.9215516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.9216073Z     kernel = self.compile(
2025-05-07T20:33:10.9216644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.9217339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.9217810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9218050Z 
2025-05-07T20:33:10.9218266Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91d88a90>
2025-05-07T20:33:10.9219399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.9220843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6525ab0>}
2025-05-07T20:33:10.9222247Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.9223355Z context = <triton._C.libtriton.ir.context object at 0x7efd91ca9870>
2025-05-07T20:33:10.9223660Z 
2025-05-07T20:33:10.9224022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.9224576Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.9225196Z                            module_map=module_map)
2025-05-07T20:33:10.9225582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.9225958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.9226235Z E       ^
2025-05-07T20:33:10.9226725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.9227194Z 
2025-05-07T20:33:10.9227629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.9228170Z 
2025-05-07T20:33:10.9228284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9228723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9229148Z     T=4096,
2025-05-07T20:33:10.9229348Z     D=5120,
2025-05-07T20:33:10.9229554Z     scale_ub=1200.0,
2025-05-07T20:33:10.9229790Z     contiguous=True,
2025-05-07T20:33:10.9230024Z     compiled=False,
2025-05-07T20:33:10.9230243Z )
2025-05-07T20:33:10.9230584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9231099Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.9231393Z 
2025-05-07T20:33:10.9231478Z     @given(
2025-05-07T20:33:10.9231728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9232056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9232386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9232738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9233088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9233386Z     )
2025-05-07T20:33:10.9233821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9234286Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9234540Z         self,
2025-05-07T20:33:10.9234753Z         T: int,
2025-05-07T20:33:10.9234966Z         D: int,
2025-05-07T20:33:10.9235198Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9235491Z         contiguous: bool,
2025-05-07T20:33:10.9235748Z         compiled: bool,
2025-05-07T20:33:10.9235984Z     ) -> None:
2025-05-07T20:33:10.9236216Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9236474Z     
2025-05-07T20:33:10.9236761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9237122Z     
2025-05-07T20:33:10.9237343Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.9237649Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.9237978Z         x = x_sign * x_clamp
2025-05-07T20:33:10.9238235Z         x0 = x[:, :D]
2025-05-07T20:33:10.9238543Z         x1 = x[:, D:]
2025-05-07T20:33:10.9238764Z     
2025-05-07T20:33:10.9238965Z         if contiguous:
2025-05-07T20:33:10.9239218Z             x0 = x0.contiguous()
2025-05-07T20:33:10.9239496Z             x1 = x1.contiguous()
2025-05-07T20:33:10.9239755Z     
2025-05-07T20:33:10.9239963Z         if scale_ub is not None:
2025-05-07T20:33:10.9240248Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.9240605Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.9240931Z             )
2025-05-07T20:33:10.9241129Z         else:
2025-05-07T20:33:10.9241352Z             scale_ub_tensor = None
2025-05-07T20:33:10.9241619Z     
2025-05-07T20:33:10.9241860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.9242187Z             op = silu_mul_quant
2025-05-07T20:33:10.9242521Z             if compiled:
2025-05-07T20:33:10.9242780Z                 op = torch.compile(op)
2025-05-07T20:33:10.9243093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9243386Z     
2025-05-07T20:33:10.9243590Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.9243770Z 
2025-05-07T20:33:10.9243875Z moe/activation_test.py:117: 
2025-05-07T20:33:10.9244189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9244618Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.9244913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9245641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.9246365Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.9246925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.9247643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.9248343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.9248901Z     kernel = self.compile(
2025-05-07T20:33:10.9249465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.9250155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.9250577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9250843Z 
2025-05-07T20:33:10.9251093Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91b5cc70>
2025-05-07T20:33:10.9252215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.9253656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb633a560>}
2025-05-07T20:33:10.9255065Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.9256143Z context = <triton._C.libtriton.ir.context object at 0x7efd91c6e930>
2025-05-07T20:33:10.9256447Z 
2025-05-07T20:33:10.9256623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.9257173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.9257668Z                            module_map=module_map)
2025-05-07T20:33:10.9258051Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.9258417Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.9258695Z E       ^
2025-05-07T20:33:10.9259198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.9259717Z 
2025-05-07T20:33:10.9260160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.9260700Z 
2025-05-07T20:33:10.9260813Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9261300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9261721Z     T=1,
2025-05-07T20:33:10.9261914Z     D=5120,
2025-05-07T20:33:10.9262114Z     scale_ub=None,
2025-05-07T20:33:10.9262341Z     contiguous=True,
2025-05-07T20:33:10.9268585Z     compiled=True,
2025-05-07T20:33:10.9268820Z )
2025-05-07T20:33:11.5194959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.5195684Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.5196204Z 
2025-05-07T20:33:11.5196322Z     @given(
2025-05-07T20:33:11.5196649Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.5197088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.5197457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.5197808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.5198247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.5198558Z     )
2025-05-07T20:33:11.5198990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.5199467Z     def test_silu_mul_quant(
2025-05-07T20:33:11.5199734Z         self,
2025-05-07T20:33:11.5199943Z         T: int,
2025-05-07T20:33:11.5200161Z         D: int,
2025-05-07T20:33:11.5200398Z         scale_ub: Optional[float],
2025-05-07T20:33:11.5200687Z         contiguous: bool,
2025-05-07T20:33:11.5200950Z         compiled: bool,
2025-05-07T20:33:11.5201204Z     ) -> None:
2025-05-07T20:33:11.5201435Z         torch.manual_seed(2025)
2025-05-07T20:33:11.5201700Z     
2025-05-07T20:33:11.5201997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.5202363Z     
2025-05-07T20:33:11.5202570Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.5202882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.5203220Z         x = x_sign * x_clamp
2025-05-07T20:33:11.5203480Z         x0 = x[:, :D]
2025-05-07T20:33:11.5203721Z         x1 = x[:, D:]
2025-05-07T20:33:11.5203947Z     
2025-05-07T20:33:11.5204147Z         if contiguous:
2025-05-07T20:33:11.5204400Z             x0 = x0.contiguous()
2025-05-07T20:33:11.5204678Z             x1 = x1.contiguous()
2025-05-07T20:33:11.5204935Z     
2025-05-07T20:33:11.5205146Z         if scale_ub is not None:
2025-05-07T20:33:11.5205444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.5205799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.5206135Z             )
2025-05-07T20:33:11.5206349Z         else:
2025-05-07T20:33:11.5206572Z             scale_ub_tensor = None
2025-05-07T20:33:11.5206845Z     
2025-05-07T20:33:11.5207098Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.5207438Z             op = silu_mul_quant
2025-05-07T20:33:11.5207705Z             if compiled:
2025-05-07T20:33:11.5207973Z                 op = torch.compile(op)
2025-05-07T20:33:11.5208301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.5208593Z     
2025-05-07T20:33:11.5208807Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.5209114Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.5209423Z     
2025-05-07T20:33:11.5209682Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.5210038Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.5210347Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.5210683Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.5211100Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.5211459Z     
2025-05-07T20:33:11.5211749Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.5211964Z 
2025-05-07T20:33:11.5212073Z moe/activation_test.py:126: 
2025-05-07T20:33:11.5212394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5212751Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.5213107Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.5213949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.5214740Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.5215321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.5216041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.5216811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.5217571Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.5218367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:11.5219240Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.5220011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.5220683Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.5221321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.5221872Z     fn()
2025-05-07T20:33:11.5222412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.5223029Z     self.fn.run(
2025-05-07T20:33:11.5223529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.5224295Z     kernel = self.compile(
2025-05-07T20:33:11.5224869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.5225565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.5225988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.5226229Z 
2025-05-07T20:33:11.5226455Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb61f5300>
2025-05-07T20:33:11.5227592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.5229055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6339510>}
2025-05-07T20:33:11.5230470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.5231547Z context = <triton._C.libtriton.ir.context object at 0x7efd915628b0>
2025-05-07T20:33:11.5231851Z 
2025-05-07T20:33:11.5232028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.5232585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.5233089Z                            module_map=module_map)
2025-05-07T20:33:11.5233480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.5233957Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.5234244Z E       ^
2025-05-07T20:33:11.5234820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.5235294Z 
2025-05-07T20:33:11.5235730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.5236294Z 
2025-05-07T20:33:11.5236407Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.5236847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.5237274Z     T=2048,
2025-05-07T20:33:11.5237476Z     D=5120,
2025-05-07T20:33:11.5237686Z     scale_ub=None,
2025-05-07T20:33:11.5237919Z     contiguous=True,
2025-05-07T20:33:11.5238154Z     compiled=True,
2025-05-07T20:33:11.5238373Z )
2025-05-07T20:33:12.0815133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.0816241Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.0816648Z 
2025-05-07T20:33:12.0816775Z     @given(
2025-05-07T20:33:12.0817124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.0817563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.0818018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.0818626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.0818984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.0819300Z     )
2025-05-07T20:33:12.0819689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.0820175Z     def test_silu_mul_quant(
2025-05-07T20:33:12.0820442Z         self,
2025-05-07T20:33:12.0820663Z         T: int,
2025-05-07T20:33:12.0820881Z         D: int,
2025-05-07T20:33:12.0821119Z         scale_ub: Optional[float],
2025-05-07T20:33:12.0821426Z         contiguous: bool,
2025-05-07T20:33:12.0821696Z         compiled: bool,
2025-05-07T20:33:12.0821940Z     ) -> None:
2025-05-07T20:33:12.0822184Z         torch.manual_seed(2025)
2025-05-07T20:33:12.0822463Z     
2025-05-07T20:33:12.0822761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.0823140Z     
2025-05-07T20:33:12.0823359Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.0823707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.0824366Z         x = x_sign * x_clamp
2025-05-07T20:33:12.0824643Z         x0 = x[:, :D]
2025-05-07T20:33:12.0824887Z         x1 = x[:, D:]
2025-05-07T20:33:12.0825112Z     
2025-05-07T20:33:12.0825323Z         if contiguous:
2025-05-07T20:33:12.0825584Z             x0 = x0.contiguous()
2025-05-07T20:33:12.0825863Z             x1 = x1.contiguous()
2025-05-07T20:33:12.0826135Z     
2025-05-07T20:33:12.0826354Z         if scale_ub is not None:
2025-05-07T20:33:12.0826663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.0827024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.0827362Z             )
2025-05-07T20:33:12.0827580Z         else:
2025-05-07T20:33:12.0827813Z             scale_ub_tensor = None
2025-05-07T20:33:12.0828088Z     
2025-05-07T20:33:12.0828344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.0828684Z             op = silu_mul_quant
2025-05-07T20:33:12.0828967Z             if compiled:
2025-05-07T20:33:12.0829246Z                 op = torch.compile(op)
2025-05-07T20:33:12.0829564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.0829866Z     
2025-05-07T20:33:12.0830090Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.0830400Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.0830738Z     
2025-05-07T20:33:12.0831020Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.0831426Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.0831752Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.0832099Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.0832603Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.0832942Z     
2025-05-07T20:33:12.0833166Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.0833378Z 
2025-05-07T20:33:12.0833571Z moe/activation_test.py:126: 
2025-05-07T20:33:12.0833898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.0834271Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.0834637Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.0835488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.0836294Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.0836886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.0837694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.0838427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.0839203Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.0840136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:12.0840937Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.0841759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.0842472Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.0843106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.0843669Z     fn()
2025-05-07T20:33:12.0844247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.0844870Z     self.fn.run(
2025-05-07T20:33:12.0845364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.0845941Z     kernel = self.compile(
2025-05-07T20:33:12.0846524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.0847211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.0847640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.0847891Z 
2025-05-07T20:33:12.0848110Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91889960>
2025-05-07T20:33:12.0849255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.0850833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91dd7880>}
2025-05-07T20:33:12.0852256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.0853326Z context = <triton._C.libtriton.ir.context object at 0x7efd91959eb0>
2025-05-07T20:33:12.0853641Z 
2025-05-07T20:33:12.0853821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.0854378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.0854883Z                            module_map=module_map)
2025-05-07T20:33:12.0855266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.0855706Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.0856001Z E       ^
2025-05-07T20:33:12.0856501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.0856989Z 
2025-05-07T20:33:12.0857428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.0857971Z 
2025-05-07T20:33:12.0858086Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.0858531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.0858952Z     T=128,
2025-05-07T20:33:12.0859162Z     D=5120,
2025-05-07T20:33:12.0859377Z     scale_ub=None,
2025-05-07T20:33:12.0859608Z     contiguous=True,
2025-05-07T20:33:12.0859905Z     compiled=True,
2025-05-07T20:33:12.0860142Z )
2025-05-07T20:33:13.0349392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0350420Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.0351087Z 
2025-05-07T20:33:13.0351284Z     @given(
2025-05-07T20:33:13.0351845Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0352643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0353233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0353693Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0354084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0354416Z     )
2025-05-07T20:33:13.0354830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0355349Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0355633Z         self,
2025-05-07T20:33:13.0355867Z         T: int,
2025-05-07T20:33:13.0356111Z         D: int,
2025-05-07T20:33:13.0356365Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0356688Z         contiguous: bool,
2025-05-07T20:33:13.0356975Z         compiled: bool,
2025-05-07T20:33:13.0357241Z     ) -> None:
2025-05-07T20:33:13.0357501Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0357788Z     
2025-05-07T20:33:13.0358108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0358514Z     
2025-05-07T20:33:13.0358754Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0359090Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0359459Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0359746Z         x0 = x[:, :D]
2025-05-07T20:33:13.0360002Z         x1 = x[:, D:]
2025-05-07T20:33:13.0360248Z     
2025-05-07T20:33:13.0360468Z         if contiguous:
2025-05-07T20:33:13.0360741Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0361041Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0361327Z     
2025-05-07T20:33:13.0361554Z         if scale_ub is not None:
2025-05-07T20:33:13.0361870Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0362268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0362633Z             )
2025-05-07T20:33:13.0362864Z         else:
2025-05-07T20:33:13.0363116Z             scale_ub_tensor = None
2025-05-07T20:33:13.0363410Z     
2025-05-07T20:33:13.0363684Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0364054Z             op = silu_mul_quant
2025-05-07T20:33:13.0364356Z             if compiled:
2025-05-07T20:33:13.0364642Z                 op = torch.compile(op)
2025-05-07T20:33:13.0364988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0365310Z     
2025-05-07T20:33:13.0365549Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.0365878Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.0366215Z     
2025-05-07T20:33:13.0366494Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0366879Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.0367223Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.0367679Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.0368092Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.0368453Z     
2025-05-07T20:33:13.0368694Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.0368922Z 
2025-05-07T20:33:13.0369046Z moe/activation_test.py:126: 
2025-05-07T20:33:13.0369403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0369793Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.0370174Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.0371084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.0372258Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.0373054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0373991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0374779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.0375702Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.0376572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:13.0377431Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.0378270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.0379011Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.0379710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.0380308Z     fn()
2025-05-07T20:33:13.0380901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.0381692Z     self.fn.run(
2025-05-07T20:33:13.0382379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0383141Z     kernel = self.compile(
2025-05-07T20:33:13.0383797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0384550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0385005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0385275Z 
2025-05-07T20:33:13.0385517Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91cd2a10>
2025-05-07T20:33:13.0386764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0388350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91070700>}
2025-05-07T20:33:13.0389892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0391054Z context = <triton._C.libtriton.ir.context object at 0x7efd913c0370>
2025-05-07T20:33:13.0391391Z 
2025-05-07T20:33:13.0391586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0392189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0392733Z                            module_map=module_map)
2025-05-07T20:33:13.0393205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0393704Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.0394019Z E       ^
2025-05-07T20:33:13.0394553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0395075Z 
2025-05-07T20:33:13.0395551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0396140Z 
2025-05-07T20:33:13.0396262Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0396741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0397198Z     T=4096,
2025-05-07T20:33:13.0397421Z     D=5120,
2025-05-07T20:33:13.0397704Z     scale_ub=None,
2025-05-07T20:33:13.0397951Z     contiguous=True,
2025-05-07T20:33:13.0398212Z     compiled=True,
2025-05-07T20:33:13.0398454Z )
2025-05-07T20:33:13.8590867Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.8591491Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.8591817Z 
2025-05-07T20:33:13.8592055Z     @given(
2025-05-07T20:33:13.8592388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.8592758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.8593120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.8593586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.8593964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.8594304Z     )
2025-05-07T20:33:13.8594721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.8595228Z     def test_silu_mul_quant(
2025-05-07T20:33:13.8595518Z         self,
2025-05-07T20:33:13.8595753Z         T: int,
2025-05-07T20:33:13.8595983Z         D: int,
2025-05-07T20:33:13.8596245Z         scale_ub: Optional[float],
2025-05-07T20:33:13.8596564Z         contiguous: bool,
2025-05-07T20:33:13.8596841Z         compiled: bool,
2025-05-07T20:33:13.8597107Z     ) -> None:
2025-05-07T20:33:13.8597362Z         torch.manual_seed(2025)
2025-05-07T20:33:13.8597644Z     
2025-05-07T20:33:13.8597994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.8598392Z     
2025-05-07T20:33:13.8598624Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.8598956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.8599316Z         x = x_sign * x_clamp
2025-05-07T20:33:13.8599601Z         x0 = x[:, :D]
2025-05-07T20:33:13.8599849Z         x1 = x[:, D:]
2025-05-07T20:33:13.8600101Z     
2025-05-07T20:33:13.8600321Z         if contiguous:
2025-05-07T20:33:13.8600601Z             x0 = x0.contiguous()
2025-05-07T20:33:13.8600896Z             x1 = x1.contiguous()
2025-05-07T20:33:13.8601178Z     
2025-05-07T20:33:13.8601406Z         if scale_ub is not None:
2025-05-07T20:33:13.8601724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.8602139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.8602493Z             )
2025-05-07T20:33:13.8602713Z         else:
2025-05-07T20:33:13.8602966Z             scale_ub_tensor = None
2025-05-07T20:33:13.8603269Z     
2025-05-07T20:33:13.8603540Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.8603906Z             op = silu_mul_quant
2025-05-07T20:33:13.8604194Z             if compiled:
2025-05-07T20:33:13.8604477Z                 op = torch.compile(op)
2025-05-07T20:33:13.8604818Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.8605131Z     
2025-05-07T20:33:13.8605348Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.8605680Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.8606017Z     
2025-05-07T20:33:13.8606293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.8606755Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.8607090Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.8607459Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.8607867Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.8608225Z     
2025-05-07T20:33:13.8608461Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.8608688Z 
2025-05-07T20:33:13.8608805Z moe/activation_test.py:126: 
2025-05-07T20:33:13.8609142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.8609518Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.8609893Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.8610784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.8611805Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.8619237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.8620040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.8620952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.8621791Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.8622657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:13.8623513Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.8624610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.8625349Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.8626041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.8626640Z     fn()
2025-05-07T20:33:13.8627229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.8627903Z     self.fn.run(
2025-05-07T20:33:13.8628442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.8629044Z     kernel = self.compile(
2025-05-07T20:33:13.8629663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.8630409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.8630869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.8631134Z 
2025-05-07T20:33:13.8631380Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91504af0>
2025-05-07T20:33:13.8632886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.8634730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90fa09d0>}
2025-05-07T20:33:13.8636258Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.8637425Z context = <triton._C.libtriton.ir.context object at 0x7efd912a45b0>
2025-05-07T20:33:13.8637758Z 
2025-05-07T20:33:13.8637952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.8638649Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.8639195Z                            module_map=module_map)
2025-05-07T20:33:13.8639617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.8640035Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.8640351Z E       ^
2025-05-07T20:33:13.8640891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.8641438Z 
2025-05-07T20:33:13.8642031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.8642762Z 
2025-05-07T20:33:13.8642915Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.8643474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.8644000Z     T=16384,
2025-05-07T20:33:13.8644232Z     D=5120,
2025-05-07T20:33:13.8644462Z     scale_ub=None,
2025-05-07T20:33:13.8644718Z     contiguous=True,
2025-05-07T20:33:13.8644982Z     compiled=True,
2025-05-07T20:33:13.8645226Z )
2025-05-07T20:33:13.9066831Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:13.9068363Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:13.9069825Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:13.9070910Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:13.9072124Z W0507 20:33:13.905000 88023 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:14.0188827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.0189881Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:14.0190429Z 
2025-05-07T20:33:14.0190589Z     @given(
2025-05-07T20:33:14.0191057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.0191680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.0192095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.0192476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.0192851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.0193179Z     )
2025-05-07T20:33:14.0193632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.0194140Z     def test_silu_mul_quant(
2025-05-07T20:33:14.0194424Z         self,
2025-05-07T20:33:14.0194651Z         T: int,
2025-05-07T20:33:14.0194881Z         D: int,
2025-05-07T20:33:14.0195133Z         scale_ub: Optional[float],
2025-05-07T20:33:14.0195441Z         contiguous: bool,
2025-05-07T20:33:14.0195720Z         compiled: bool,
2025-05-07T20:33:14.0195983Z     ) -> None:
2025-05-07T20:33:14.0196231Z         torch.manual_seed(2025)
2025-05-07T20:33:14.0196511Z     
2025-05-07T20:33:14.0196832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.0197220Z     
2025-05-07T20:33:14.0197444Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.0197782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.0198133Z         x = x_sign * x_clamp
2025-05-07T20:33:14.0198416Z         x0 = x[:, :D]
2025-05-07T20:33:14.0198670Z         x1 = x[:, D:]
2025-05-07T20:33:14.0198916Z     
2025-05-07T20:33:14.0199126Z         if contiguous:
2025-05-07T20:33:14.0199398Z             x0 = x0.contiguous()
2025-05-07T20:33:14.0199704Z             x1 = x1.contiguous()
2025-05-07T20:33:14.0200077Z     
2025-05-07T20:33:14.0200310Z         if scale_ub is not None:
2025-05-07T20:33:14.0200631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.0201010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.0201371Z             )
2025-05-07T20:33:14.0201610Z         else:
2025-05-07T20:33:14.0201853Z             scale_ub_tensor = None
2025-05-07T20:33:14.0202143Z     
2025-05-07T20:33:14.0202414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.0202771Z             op = silu_mul_quant
2025-05-07T20:33:14.0203061Z             if compiled:
2025-05-07T20:33:14.0203353Z                 op = torch.compile(op)
2025-05-07T20:33:14.0203689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.0204075Z     
2025-05-07T20:33:14.0204299Z         y_fp8, y_scale = fn()
2025-05-07T20:33:14.0204626Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:14.0204956Z     
2025-05-07T20:33:14.0205232Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.0205614Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:14.0205945Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:14.0206376Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:14.0206866Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.0207220Z     
2025-05-07T20:33:14.0207457Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:14.0207681Z 
2025-05-07T20:33:14.0207803Z moe/activation_test.py:126: 
2025-05-07T20:33:14.0208138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.0208522Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:14.0208898Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.0209800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:14.0210647Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:14.0211269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.0212048Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.0212830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:14.0213646Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.0214502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:14.0215350Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.0216190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:14.0216919Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:14.0217606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:14.0218198Z     fn()
2025-05-07T20:33:14.0218772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:14.0219434Z     self.fn.run(
2025-05-07T20:33:14.0219969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.0220573Z     kernel = self.compile(
2025-05-07T20:33:14.0221180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.0221926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.0222381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.0222642Z 
2025-05-07T20:33:14.0222932Z self = <triton.compiler.compiler.ASTSource object at 0x7efd910f1570>
2025-05-07T20:33:14.0224330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.0225894Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90fa2170>}
2025-05-07T20:33:14.0227416Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.0228650Z context = <triton._C.libtriton.ir.context object at 0x7efd90c7d9f0>
2025-05-07T20:33:14.0228978Z 
2025-05-07T20:33:14.0229176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.0229776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.0230316Z                            module_map=module_map)
2025-05-07T20:33:14.0230803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.0231271Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:14.0231585Z E       ^
2025-05-07T20:33:14.0232117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.0232622Z 
2025-05-07T20:33:14.0233090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.0233739Z 
2025-05-07T20:33:14.0233864Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.0234334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.0234792Z     T=1,
2025-05-07T20:33:14.0235004Z     D=5120,
2025-05-07T20:33:14.0235234Z     scale_ub=1200.0,
2025-05-07T20:33:14.0235492Z     contiguous=True,
2025-05-07T20:33:14.0235744Z     compiled=True,
2025-05-07T20:33:14.0235980Z )
2025-05-07T20:33:14.1823307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.1824619Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:14.1825145Z 
2025-05-07T20:33:14.1825310Z     @given(
2025-05-07T20:33:14.1825782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.1826412Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.1827019Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.1827684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.1828346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.1828924Z     )
2025-05-07T20:33:14.1829620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.1830508Z     def test_silu_mul_quant(
2025-05-07T20:33:14.1830998Z         self,
2025-05-07T20:33:14.1831388Z         T: int,
2025-05-07T20:33:14.1831791Z         D: int,
2025-05-07T20:33:14.1832125Z         scale_ub: Optional[float],
2025-05-07T20:33:14.1832449Z         contiguous: bool,
2025-05-07T20:33:14.1832732Z         compiled: bool,
2025-05-07T20:33:14.1832993Z     ) -> None:
2025-05-07T20:33:14.1833242Z         torch.manual_seed(2025)
2025-05-07T20:33:14.1833584Z     
2025-05-07T20:33:14.1833903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.1834290Z     
2025-05-07T20:33:14.1834521Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.1834862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.1835222Z         x = x_sign * x_clamp
2025-05-07T20:33:14.1835501Z         x0 = x[:, :D]
2025-05-07T20:33:14.1835759Z         x1 = x[:, D:]
2025-05-07T20:33:14.1836004Z     
2025-05-07T20:33:14.1836219Z         if contiguous:
2025-05-07T20:33:14.1836658Z             x0 = x0.contiguous()
2025-05-07T20:33:14.1836964Z             x1 = x1.contiguous()
2025-05-07T20:33:14.1837238Z     
2025-05-07T20:33:14.1837462Z         if scale_ub is not None:
2025-05-07T20:33:14.1837781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.1838162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.1838515Z             )
2025-05-07T20:33:14.1838735Z         else:
2025-05-07T20:33:14.1838988Z             scale_ub_tensor = None
2025-05-07T20:33:14.1839281Z     
2025-05-07T20:33:14.1839543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.1839905Z             op = silu_mul_quant
2025-05-07T20:33:14.1840198Z             if compiled:
2025-05-07T20:33:14.1840487Z                 op = torch.compile(op)
2025-05-07T20:33:14.1840901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.1841217Z     
2025-05-07T20:33:14.1841445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.1841633Z 
2025-05-07T20:33:14.1841754Z moe/activation_test.py:117: 
2025-05-07T20:33:14.1842095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.1842474Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.1842868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.1843561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:14.1844197Z     return fn(*args, **kwargs)
2025-05-07T20:33:14.1844949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.1845720Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.1846327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.1847101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.1847843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.1848443Z     kernel = self.compile(
2025-05-07T20:33:14.1849057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.1849803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.1850248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.1850510Z 
2025-05-07T20:33:14.1850744Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90ead600>
2025-05-07T20:33:14.1851984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.1853566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90f9a050>}
2025-05-07T20:33:14.1855069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.1856219Z context = <triton._C.libtriton.ir.context object at 0x7efd905caab0>
2025-05-07T20:33:14.1856550Z 
2025-05-07T20:33:14.1856741Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.1857337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.1857860Z                            module_map=module_map)
2025-05-07T20:33:14.1858275Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.1858683Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.1858978Z E       ^
2025-05-07T20:33:14.1859555Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.1860066Z 
2025-05-07T20:33:14.1860534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.1861108Z 
2025-05-07T20:33:14.1861235Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.1861705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.1862169Z     T=1,
2025-05-07T20:33:14.1862415Z     D=5120,
2025-05-07T20:33:14.1862637Z     scale_ub=None,
2025-05-07T20:33:14.1862882Z     contiguous=False,
2025-05-07T20:33:14.1863143Z     compiled=True,
2025-05-07T20:33:14.1863376Z )
2025-05-07T20:33:14.2603909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.2604605Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:14.2604906Z 
2025-05-07T20:33:14.2604997Z     @given(
2025-05-07T20:33:14.2605266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.2605613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.2605960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.2606411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.2606846Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.2607169Z     )
2025-05-07T20:33:14.2607568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.2608065Z     def test_silu_mul_quant(
2025-05-07T20:33:14.2608335Z         self,
2025-05-07T20:33:14.2608557Z         T: int,
2025-05-07T20:33:14.2608785Z         D: int,
2025-05-07T20:33:14.2609025Z         scale_ub: Optional[float],
2025-05-07T20:33:14.2609333Z         contiguous: bool,
2025-05-07T20:33:14.2609611Z         compiled: bool,
2025-05-07T20:33:14.2609859Z     ) -> None:
2025-05-07T20:33:14.2610106Z         torch.manual_seed(2025)
2025-05-07T20:33:14.2610378Z     
2025-05-07T20:33:14.2610686Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.2611066Z     
2025-05-07T20:33:14.2611286Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.2611613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.2611969Z         x = x_sign * x_clamp
2025-05-07T20:33:14.2612242Z         x0 = x[:, :D]
2025-05-07T20:33:14.2612486Z         x1 = x[:, D:]
2025-05-07T20:33:14.2612719Z     
2025-05-07T20:33:14.2612932Z         if contiguous:
2025-05-07T20:33:14.2613196Z             x0 = x0.contiguous()
2025-05-07T20:33:14.2613485Z             x1 = x1.contiguous()
2025-05-07T20:33:14.2613762Z     
2025-05-07T20:33:14.2613979Z         if scale_ub is not None:
2025-05-07T20:33:14.2614287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.2614668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.2615019Z             )
2025-05-07T20:33:14.2615237Z         else:
2025-05-07T20:33:14.2615479Z             scale_ub_tensor = None
2025-05-07T20:33:14.2615762Z     
2025-05-07T20:33:14.2616019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.2616369Z             op = silu_mul_quant
2025-05-07T20:33:14.2616656Z             if compiled:
2025-05-07T20:33:14.2616935Z                 op = torch.compile(op)
2025-05-07T20:33:14.2617271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.2617581Z     
2025-05-07T20:33:14.2617795Z         y_fp8, y_scale = fn()
2025-05-07T20:33:14.2618117Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:14.2618442Z     
2025-05-07T20:33:14.2618711Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.2619083Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:14.2619418Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:14.2619770Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:14.2620244Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.2620600Z     
2025-05-07T20:33:14.2620833Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:14.2621051Z 
2025-05-07T20:33:14.2621163Z moe/activation_test.py:126: 
2025-05-07T20:33:14.2621508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2621886Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:14.2622297Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.2623183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:14.2624185Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:14.2624800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.2625633Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.2626406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:14.2627216Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.2628188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:14.2629024Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.2629838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:14.2630551Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:14.2631231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:14.2631819Z     fn()
2025-05-07T20:33:14.2632433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:14.2633086Z     self.fn.run(
2025-05-07T20:33:14.2633709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.2634311Z     kernel = self.compile(
2025-05-07T20:33:14.2634917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.2635650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.2636089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.2636350Z 
2025-05-07T20:33:14.2636582Z self = <triton.compiler.compiler.ASTSource object at 0x7efd912f3610>
2025-05-07T20:33:14.2637788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.2639333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91499900>}
2025-05-07T20:33:14.2640835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.2641984Z context = <triton._C.libtriton.ir.context object at 0x7efd9052fab0>
2025-05-07T20:33:14.2642313Z 
2025-05-07T20:33:14.2642499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.2643085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.2643609Z                            module_map=module_map)
2025-05-07T20:33:14.2644018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.2644496Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:14.2644801Z E       ^
2025-05-07T20:33:14.2645319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.2645830Z 
2025-05-07T20:33:14.2646300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.2646876Z 
2025-05-07T20:33:14.2647007Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.2647476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.2647922Z     T=1,
2025-05-07T20:33:14.2648131Z     D=5120,
2025-05-07T20:33:14.2648351Z     scale_ub=None,
2025-05-07T20:33:14.2648590Z     contiguous=True,
2025-05-07T20:33:14.2648844Z     compiled=False,
2025-05-07T20:33:14.2649132Z )
2025-05-07T20:33:14.6098394Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.6099463Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:14.6099977Z 
2025-05-07T20:33:14.6100136Z     @given(
2025-05-07T20:33:14.6100594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.6101193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.6102145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.6102620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.6103000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.6103327Z     )
2025-05-07T20:33:14.6103732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.6104239Z     def test_silu_mul_quant(
2025-05-07T20:33:14.6104520Z         self,
2025-05-07T20:33:14.6104752Z         T: int,
2025-05-07T20:33:14.6104990Z         D: int,
2025-05-07T20:33:14.6105241Z         scale_ub: Optional[float],
2025-05-07T20:33:14.6105554Z         contiguous: bool,
2025-05-07T20:33:14.6105834Z         compiled: bool,
2025-05-07T20:33:14.6106095Z     ) -> None:
2025-05-07T20:33:14.6106349Z         torch.manual_seed(2025)
2025-05-07T20:33:14.6106631Z     
2025-05-07T20:33:14.6106943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.6107341Z     
2025-05-07T20:33:14.6107569Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.6107908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.6108273Z         x = x_sign * x_clamp
2025-05-07T20:33:14.6108552Z         x0 = x[:, :D]
2025-05-07T20:33:14.6108808Z         x1 = x[:, D:]
2025-05-07T20:33:14.6109045Z     
2025-05-07T20:33:14.6109265Z         if contiguous:
2025-05-07T20:33:14.6109535Z             x0 = x0.contiguous()
2025-05-07T20:33:14.6109832Z             x1 = x1.contiguous()
2025-05-07T20:33:14.6110114Z     
2025-05-07T20:33:14.6110345Z         if scale_ub is not None:
2025-05-07T20:33:14.6110667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.6111051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.6111405Z             )
2025-05-07T20:33:14.6111633Z         else:
2025-05-07T20:33:14.6111882Z             scale_ub_tensor = None
2025-05-07T20:33:14.6112165Z     
2025-05-07T20:33:14.6112440Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.6119492Z             op = silu_mul_quant
2025-05-07T20:33:14.6119813Z             if compiled:
2025-05-07T20:33:14.6120102Z                 op = torch.compile(op)
2025-05-07T20:33:14.6120450Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.6120773Z     
2025-05-07T20:33:14.6120995Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.6121193Z 
2025-05-07T20:33:14.6121310Z moe/activation_test.py:117: 
2025-05-07T20:33:14.6121653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.6122045Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.6122366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.6123279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.6124267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.6124880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.6125668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.6126430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.6127048Z     kernel = self.compile(
2025-05-07T20:33:14.6127667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.6128420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.6128967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.6129238Z 
2025-05-07T20:33:14.6129479Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90b9ea70>
2025-05-07T20:33:14.6130782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.6132428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd914988b0>}
2025-05-07T20:33:14.6133964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.6135136Z context = <triton._C.libtriton.ir.context object at 0x7efd90828b30>
2025-05-07T20:33:14.6135465Z 
2025-05-07T20:33:14.6135657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.6136260Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.6136798Z                            module_map=module_map)
2025-05-07T20:33:14.6137221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.6137628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.6137928Z E       ^
2025-05-07T20:33:14.6138463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.6138976Z 
2025-05-07T20:33:14.6139450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.6140038Z 
2025-05-07T20:33:14.6140159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.6140633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.6141101Z     T=128,
2025-05-07T20:33:14.6141319Z     D=5120,
2025-05-07T20:33:14.6141547Z     scale_ub=None,
2025-05-07T20:33:14.6141805Z     contiguous=False,
2025-05-07T20:33:14.6142064Z     compiled=True,
2025-05-07T20:33:14.6142302Z )
2025-05-07T20:33:14.6142672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.6143242Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:14.6143557Z 
2025-05-07T20:33:14.6143646Z     @given(
2025-05-07T20:33:14.6143911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.6144269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.6144621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.6145002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.6145385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.6145718Z     )
2025-05-07T20:33:14.6146124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.6146706Z     def test_silu_mul_quant(
2025-05-07T20:33:14.6146988Z         self,
2025-05-07T20:33:14.6147214Z         T: int,
2025-05-07T20:33:14.6147444Z         D: int,
2025-05-07T20:33:14.6147690Z         scale_ub: Optional[float],
2025-05-07T20:33:14.6148008Z         contiguous: bool,
2025-05-07T20:33:14.6148292Z         compiled: bool,
2025-05-07T20:33:14.6148554Z     ) -> None:
2025-05-07T20:33:14.6148810Z         torch.manual_seed(2025)
2025-05-07T20:33:14.6149093Z     
2025-05-07T20:33:14.6149405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.6149800Z     
2025-05-07T20:33:14.6150028Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.6150371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.6150727Z         x = x_sign * x_clamp
2025-05-07T20:33:14.6151062Z         x0 = x[:, :D]
2025-05-07T20:33:14.6151317Z         x1 = x[:, D:]
2025-05-07T20:33:14.6151556Z     
2025-05-07T20:33:14.6151775Z         if contiguous:
2025-05-07T20:33:14.6152049Z             x0 = x0.contiguous()
2025-05-07T20:33:14.6152392Z             x1 = x1.contiguous()
2025-05-07T20:33:14.6152670Z     
2025-05-07T20:33:14.6152897Z         if scale_ub is not None:
2025-05-07T20:33:14.6153212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.6153821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.6154181Z             )
2025-05-07T20:33:14.6154400Z         else:
2025-05-07T20:33:14.6154646Z             scale_ub_tensor = None
2025-05-07T20:33:14.6154937Z     
2025-05-07T20:33:14.6155203Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.6155570Z             op = silu_mul_quant
2025-05-07T20:33:14.6155860Z             if compiled:
2025-05-07T20:33:14.6156149Z                 op = torch.compile(op)
2025-05-07T20:33:14.6156494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.6156812Z     
2025-05-07T20:33:14.6157039Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.6157231Z 
2025-05-07T20:33:14.6157348Z moe/activation_test.py:117: 
2025-05-07T20:33:14.6157686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.6158067Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.6158389Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.6159029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:14.6159667Z     return fn(*args, **kwargs)
2025-05-07T20:33:14.6160419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.6161216Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.6161835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.6162664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.6163418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.6164022Z     kernel = self.compile(
2025-05-07T20:33:14.6164640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.6165392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.6165844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.6166110Z 
2025-05-07T20:33:14.6166346Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90512080>
2025-05-07T20:33:14.6167567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.6169170Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9149b880>}
2025-05-07T20:33:14.6170686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.6171848Z context = <triton._C.libtriton.ir.context object at 0x7efd9085d830>
2025-05-07T20:33:14.6172219Z 
2025-05-07T20:33:14.6172444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.6173048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.6173594Z                            module_map=module_map)
2025-05-07T20:33:14.6174014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.6174471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.6174781Z E       ^
2025-05-07T20:33:14.6175312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.6175830Z 
2025-05-07T20:33:14.6176303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.6176941Z 
2025-05-07T20:33:14.6177107Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.6177589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.6178045Z     T=128,
2025-05-07T20:33:14.6178272Z     D=7168,
2025-05-07T20:33:14.6178504Z     scale_ub=1200.0,
2025-05-07T20:33:14.6178768Z     contiguous=False,
2025-05-07T20:33:14.6179033Z     compiled=False,
2025-05-07T20:33:14.6179278Z )
2025-05-07T20:33:14.7560934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.7562003Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:14.7562322Z 
2025-05-07T20:33:14.7562424Z     @given(
2025-05-07T20:33:14.7562691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.7563046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.7563398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.7563775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.7564152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.7564478Z     )
2025-05-07T20:33:14.7564871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.7565369Z     def test_silu_mul_quant(
2025-05-07T20:33:14.7565647Z         self,
2025-05-07T20:33:14.7565869Z         T: int,
2025-05-07T20:33:14.7566096Z         D: int,
2025-05-07T20:33:14.7566347Z         scale_ub: Optional[float],
2025-05-07T20:33:14.7566654Z         contiguous: bool,
2025-05-07T20:33:14.7566935Z         compiled: bool,
2025-05-07T20:33:14.7567202Z     ) -> None:
2025-05-07T20:33:14.7567454Z         torch.manual_seed(2025)
2025-05-07T20:33:14.7567727Z     
2025-05-07T20:33:14.7568041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.7568426Z     
2025-05-07T20:33:14.7568649Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.7568978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.7569341Z         x = x_sign * x_clamp
2025-05-07T20:33:14.7569613Z         x0 = x[:, :D]
2025-05-07T20:33:14.7569865Z         x1 = x[:, D:]
2025-05-07T20:33:14.7570105Z     
2025-05-07T20:33:14.7570316Z         if contiguous:
2025-05-07T20:33:14.7570585Z             x0 = x0.contiguous()
2025-05-07T20:33:14.7570884Z             x1 = x1.contiguous()
2025-05-07T20:33:14.7571153Z     
2025-05-07T20:33:14.7571379Z         if scale_ub is not None:
2025-05-07T20:33:14.7571692Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.7572069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.7572423Z             )
2025-05-07T20:33:14.7572650Z         else:
2025-05-07T20:33:14.7573007Z             scale_ub_tensor = None
2025-05-07T20:33:14.7573293Z     
2025-05-07T20:33:14.7573559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7573917Z             op = silu_mul_quant
2025-05-07T20:33:14.7574201Z             if compiled:
2025-05-07T20:33:14.7574483Z                 op = torch.compile(op)
2025-05-07T20:33:14.7574821Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7575131Z     
2025-05-07T20:33:14.7575351Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.7575539Z 
2025-05-07T20:33:14.7575657Z moe/activation_test.py:117: 
2025-05-07T20:33:14.7575986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7576361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.7576679Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7577532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.7578309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.7578913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.7579679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.7580585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.7581183Z     kernel = self.compile(
2025-05-07T20:33:14.7581793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.7582534Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.7582984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7583246Z 
2025-05-07T20:33:14.7583480Z self = <triton.compiler.compiler.ASTSource object at 0x7efd905123e0>
2025-05-07T20:33:14.7584689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.7586232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91499090>}
2025-05-07T20:33:14.7587732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.7588878Z context = <triton._C.libtriton.ir.context object at 0x7efd90814830>
2025-05-07T20:33:14.7589203Z 
2025-05-07T20:33:14.7589397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.7589981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.7590511Z                            module_map=module_map)
2025-05-07T20:33:14.7590919Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.7591316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.7591613Z E       ^
2025-05-07T20:33:14.7592141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.7592698Z 
2025-05-07T20:33:14.7593163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.7593807Z 
2025-05-07T20:33:14.7593927Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.7594399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.7594849Z     T=128,
2025-05-07T20:33:14.7595065Z     D=5120,
2025-05-07T20:33:14.7595289Z     scale_ub=None,
2025-05-07T20:33:14.7595536Z     contiguous=False,
2025-05-07T20:33:14.7595792Z     compiled=False,
2025-05-07T20:33:14.7596105Z )
2025-05-07T20:33:14.7596469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.7597021Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:14.7597336Z 
2025-05-07T20:33:14.7597428Z     @given(
2025-05-07T20:33:14.7597696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.7598046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.7598399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.7598776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.7599148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.7599473Z     )
2025-05-07T20:33:14.7599872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.7600418Z     def test_silu_mul_quant(
2025-05-07T20:33:14.7600692Z         self,
2025-05-07T20:33:14.7600919Z         T: int,
2025-05-07T20:33:14.7601153Z         D: int,
2025-05-07T20:33:14.7601401Z         scale_ub: Optional[float],
2025-05-07T20:33:14.7601713Z         contiguous: bool,
2025-05-07T20:33:14.7602012Z         compiled: bool,
2025-05-07T20:33:14.7602295Z     ) -> None:
2025-05-07T20:33:14.7602593Z         torch.manual_seed(2025)
2025-05-07T20:33:14.7602912Z     
2025-05-07T20:33:14.7603231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.7603631Z     
2025-05-07T20:33:14.7603854Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.7604182Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.7604534Z         x = x_sign * x_clamp
2025-05-07T20:33:14.7604811Z         x0 = x[:, :D]
2025-05-07T20:33:14.7605061Z         x1 = x[:, D:]
2025-05-07T20:33:14.7605301Z     
2025-05-07T20:33:14.7605526Z         if contiguous:
2025-05-07T20:33:14.7605786Z             x0 = x0.contiguous()
2025-05-07T20:33:14.7606081Z             x1 = x1.contiguous()
2025-05-07T20:33:14.7606357Z     
2025-05-07T20:33:14.7606582Z         if scale_ub is not None:
2025-05-07T20:33:14.7606891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.7607269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.7607625Z             )
2025-05-07T20:33:14.7607843Z         else:
2025-05-07T20:33:14.7608088Z             scale_ub_tensor = None
2025-05-07T20:33:14.7608374Z     
2025-05-07T20:33:14.7608636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7608995Z             op = silu_mul_quant
2025-05-07T20:33:14.7609278Z             if compiled:
2025-05-07T20:33:14.7609555Z                 op = torch.compile(op)
2025-05-07T20:33:14.7609889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7610203Z     
2025-05-07T20:33:14.7610421Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.7610617Z 
2025-05-07T20:33:14.7610730Z moe/activation_test.py:117: 
2025-05-07T20:33:14.7611075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7611454Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.7611780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7612748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.7613729Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.7614459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.7615226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.7615972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.7616578Z     kernel = self.compile(
2025-05-07T20:33:14.7617185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.7617978Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.7618425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7618684Z 
2025-05-07T20:33:14.7618924Z self = <triton.compiler.compiler.ASTSource object at 0x7efd908c4e20>
2025-05-07T20:33:14.7620130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.7621693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a3eb0>}
2025-05-07T20:33:14.7623570Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.7624940Z context = <triton._C.libtriton.ir.context object at 0x7efd90495230>
2025-05-07T20:33:14.7625264Z 
2025-05-07T20:33:14.7625454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.7626176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.7626707Z                            module_map=module_map)
2025-05-07T20:33:14.7627123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.7627520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.7627817Z E       ^
2025-05-07T20:33:14.7628344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.7628849Z 
2025-05-07T20:33:14.7629314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.7629893Z 
2025-05-07T20:33:14.7630011Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.7630483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.7630938Z     T=128,
2025-05-07T20:33:14.7631150Z     D=5120,
2025-05-07T20:33:14.7631380Z     scale_ub=1200.0,
2025-05-07T20:33:14.7631639Z     contiguous=True,
2025-05-07T20:33:14.7631892Z     compiled=False,
2025-05-07T20:33:14.7632154Z )
2025-05-07T20:33:14.9752883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.9753486Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:14.9753868Z 
2025-05-07T20:33:14.9754005Z     @given(
2025-05-07T20:33:14.9754383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.9754753Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.9755101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.9755468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.9755837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.9756156Z     )
2025-05-07T20:33:14.9756545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.9757038Z     def test_silu_mul_quant(
2025-05-07T20:33:14.9757308Z         self,
2025-05-07T20:33:14.9757526Z         T: int,
2025-05-07T20:33:14.9757747Z         D: int,
2025-05-07T20:33:14.9757990Z         scale_ub: Optional[float],
2025-05-07T20:33:14.9758296Z         contiguous: bool,
2025-05-07T20:33:14.9758564Z         compiled: bool,
2025-05-07T20:33:14.9758816Z     ) -> None:
2025-05-07T20:33:14.9759057Z         torch.manual_seed(2025)
2025-05-07T20:33:14.9759329Z     
2025-05-07T20:33:14.9759635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.9760009Z     
2025-05-07T20:33:14.9760229Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.9760559Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.9760900Z         x = x_sign * x_clamp
2025-05-07T20:33:14.9761297Z         x0 = x[:, :D]
2025-05-07T20:33:14.9761550Z         x1 = x[:, D:]
2025-05-07T20:33:14.9761778Z     
2025-05-07T20:33:14.9761992Z         if contiguous:
2025-05-07T20:33:14.9762288Z             x0 = x0.contiguous()
2025-05-07T20:33:14.9762590Z             x1 = x1.contiguous()
2025-05-07T20:33:14.9762858Z     
2025-05-07T20:33:14.9763076Z         if scale_ub is not None:
2025-05-07T20:33:14.9763379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.9763753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.9764100Z             )
2025-05-07T20:33:14.9764318Z         else:
2025-05-07T20:33:14.9764552Z             scale_ub_tensor = None
2025-05-07T20:33:14.9764838Z     
2025-05-07T20:33:14.9765094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.9765531Z             op = silu_mul_quant
2025-05-07T20:33:14.9765820Z             if compiled:
2025-05-07T20:33:14.9766107Z                 op = torch.compile(op)
2025-05-07T20:33:14.9766469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9766775Z     
2025-05-07T20:33:14.9766984Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.9767173Z 
2025-05-07T20:33:14.9767284Z moe/activation_test.py:117: 
2025-05-07T20:33:14.9767746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9768112Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.9768426Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9769192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.9769951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.9770549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.9771310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.9772046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.9772634Z     kernel = self.compile(
2025-05-07T20:33:14.9773237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.9773969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.9774410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9774663Z 
2025-05-07T20:33:14.9774890Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90846140>
2025-05-07T20:33:14.9776083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.9777611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a2d40>}
2025-05-07T20:33:14.9779102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.9780238Z context = <triton._C.libtriton.ir.context object at 0x7efd904248b0>
2025-05-07T20:33:14.9780563Z 
2025-05-07T20:33:14.9780748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.9781323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.9781842Z                            module_map=module_map)
2025-05-07T20:33:14.9782245Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.9782695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.9782985Z E       ^
2025-05-07T20:33:14.9783546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.9784049Z 
2025-05-07T20:33:14.9784509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.9785082Z 
2025-05-07T20:33:14.9785203Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.9785665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.9786112Z     T=1,
2025-05-07T20:33:14.9786314Z     D=7168,
2025-05-07T20:33:14.9786532Z     scale_ub=1200.0,
2025-05-07T20:33:14.9786776Z     contiguous=True,
2025-05-07T20:33:14.9787023Z     compiled=True,
2025-05-07T20:33:14.9787244Z )
2025-05-07T20:33:14.9787604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.9788224Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:14.9788513Z 
2025-05-07T20:33:14.9788602Z     @given(
2025-05-07T20:33:14.9788997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.9795687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.9796045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.9796509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.9796921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.9797251Z     )
2025-05-07T20:33:14.9797648Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.9798140Z     def test_silu_mul_quant(
2025-05-07T20:33:14.9798413Z         self,
2025-05-07T20:33:14.9798635Z         T: int,
2025-05-07T20:33:14.9798856Z         D: int,
2025-05-07T20:33:14.9799097Z         scale_ub: Optional[float],
2025-05-07T20:33:14.9799406Z         contiguous: bool,
2025-05-07T20:33:14.9799679Z         compiled: bool,
2025-05-07T20:33:14.9799928Z     ) -> None:
2025-05-07T20:33:14.9800171Z         torch.manual_seed(2025)
2025-05-07T20:33:14.9800445Z     
2025-05-07T20:33:14.9800754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.9801136Z     
2025-05-07T20:33:14.9801355Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.9801675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.9802030Z         x = x_sign * x_clamp
2025-05-07T20:33:14.9802302Z         x0 = x[:, :D]
2025-05-07T20:33:14.9802540Z         x1 = x[:, D:]
2025-05-07T20:33:14.9802774Z     
2025-05-07T20:33:14.9802983Z         if contiguous:
2025-05-07T20:33:14.9803244Z             x0 = x0.contiguous()
2025-05-07T20:33:14.9803531Z             x1 = x1.contiguous()
2025-05-07T20:33:14.9803803Z     
2025-05-07T20:33:14.9804017Z         if scale_ub is not None:
2025-05-07T20:33:14.9804320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.9804699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.9805051Z             )
2025-05-07T20:33:14.9805263Z         else:
2025-05-07T20:33:14.9805500Z             scale_ub_tensor = None
2025-05-07T20:33:14.9805781Z     
2025-05-07T20:33:14.9806037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.9806388Z             op = silu_mul_quant
2025-05-07T20:33:14.9806673Z             if compiled:
2025-05-07T20:33:14.9806952Z                 op = torch.compile(op)
2025-05-07T20:33:14.9807292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9807599Z     
2025-05-07T20:33:14.9807811Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:14.9808002Z 
2025-05-07T20:33:14.9808113Z moe/activation_test.py:117: 
2025-05-07T20:33:14.9808442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9808809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:14.9809121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.9809754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:14.9810381Z     return fn(*args, **kwargs)
2025-05-07T20:33:14.9811165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:14.9811927Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:14.9812529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.9813288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.9814020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.9814608Z     kernel = self.compile(
2025-05-07T20:33:14.9815210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.9816005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.9816446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.9816712Z 
2025-05-07T20:33:14.9816941Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90414730>
2025-05-07T20:33:14.9818178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.9819744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a1480>}
2025-05-07T20:33:14.9821234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.9822376Z context = <triton._C.libtriton.ir.context object at 0x7efd906a57f0>
2025-05-07T20:33:14.9822701Z 
2025-05-07T20:33:14.9822890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.9823470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.9824305Z                            module_map=module_map)
2025-05-07T20:33:14.9824720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.9825112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:14.9825397Z E       ^
2025-05-07T20:33:14.9825912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.9826412Z 
2025-05-07T20:33:14.9826873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.9827441Z 
2025-05-07T20:33:14.9827562Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.9828022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.9828471Z     T=1,
2025-05-07T20:33:14.9828679Z     D=7168,
2025-05-07T20:33:14.9828892Z     scale_ub=1200.0,
2025-05-07T20:33:14.9829145Z     contiguous=False,
2025-05-07T20:33:14.9829397Z     compiled=True,
2025-05-07T20:33:14.9829625Z )
2025-05-07T20:33:15.1327213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.1327877Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.1328305Z 
2025-05-07T20:33:15.1328398Z     @given(
2025-05-07T20:33:15.1328661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.1329002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.1329349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.1329718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.1330086Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.1330403Z     )
2025-05-07T20:33:15.1330915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.1331412Z     def test_silu_mul_quant(
2025-05-07T20:33:15.1331679Z         self,
2025-05-07T20:33:15.1331901Z         T: int,
2025-05-07T20:33:15.1332131Z         D: int,
2025-05-07T20:33:15.1332413Z         scale_ub: Optional[float],
2025-05-07T20:33:15.1332717Z         contiguous: bool,
2025-05-07T20:33:15.1332988Z         compiled: bool,
2025-05-07T20:33:15.1333234Z     ) -> None:
2025-05-07T20:33:15.1333475Z         torch.manual_seed(2025)
2025-05-07T20:33:15.1333761Z     
2025-05-07T20:33:15.1334060Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.1334450Z     
2025-05-07T20:33:15.1334669Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.1334988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.1335405Z         x = x_sign * x_clamp
2025-05-07T20:33:15.1335672Z         x0 = x[:, :D]
2025-05-07T20:33:15.1335914Z         x1 = x[:, D:]
2025-05-07T20:33:15.1336151Z     
2025-05-07T20:33:15.1336367Z         if contiguous:
2025-05-07T20:33:15.1336624Z             x0 = x0.contiguous()
2025-05-07T20:33:15.1336915Z             x1 = x1.contiguous()
2025-05-07T20:33:15.1337184Z     
2025-05-07T20:33:15.1337395Z         if scale_ub is not None:
2025-05-07T20:33:15.1337828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.1338206Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.1338550Z             )
2025-05-07T20:33:15.1338763Z         else:
2025-05-07T20:33:15.1339002Z             scale_ub_tensor = None
2025-05-07T20:33:15.1339285Z     
2025-05-07T20:33:15.1339543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.1339894Z             op = silu_mul_quant
2025-05-07T20:33:15.1340175Z             if compiled:
2025-05-07T20:33:15.1340451Z                 op = torch.compile(op)
2025-05-07T20:33:15.1340783Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.1341089Z     
2025-05-07T20:33:15.1341304Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.1341500Z 
2025-05-07T20:33:15.1341613Z moe/activation_test.py:117: 
2025-05-07T20:33:15.1341940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.1342344Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.1342700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.1343332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.1343957Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.1344685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.1345451Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.1346046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.1346811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.1347541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.1348131Z     kernel = self.compile(
2025-05-07T20:33:15.1348739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.1349463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.1349910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.1350168Z 
2025-05-07T20:33:15.1350401Z self = <triton.compiler.compiler.ASTSource object at 0x7efd904290c0>
2025-05-07T20:33:15.1351600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.1353205Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a0940>}
2025-05-07T20:33:15.1354753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.1355887Z context = <triton._C.libtriton.ir.context object at 0x7efd906d0c30>
2025-05-07T20:33:15.1356204Z 
2025-05-07T20:33:15.1356393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.1356969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.1357480Z                            module_map=module_map)
2025-05-07T20:33:15.1357936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.1358326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.1358610Z E       ^
2025-05-07T20:33:15.1359125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.1359618Z 
2025-05-07T20:33:15.1360077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.1360756Z 
2025-05-07T20:33:15.1360880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.1361334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.1361775Z     T=1,
2025-05-07T20:33:15.1361987Z     D=7168,
2025-05-07T20:33:15.1362225Z     scale_ub=None,
2025-05-07T20:33:15.1362494Z     contiguous=False,
2025-05-07T20:33:15.1362786Z     compiled=True,
2025-05-07T20:33:15.1363011Z )
2025-05-07T20:33:15.4079488Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.4080274Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:15.4080691Z 
2025-05-07T20:33:15.4080827Z     @given(
2025-05-07T20:33:15.4081170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.4081578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.4081931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.4082379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.4082813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.4083140Z     )
2025-05-07T20:33:15.4083551Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.4084121Z     def test_silu_mul_quant(
2025-05-07T20:33:15.4084405Z         self,
2025-05-07T20:33:15.4084694Z         T: int,
2025-05-07T20:33:15.4084927Z         D: int,
2025-05-07T20:33:15.4085177Z         scale_ub: Optional[float],
2025-05-07T20:33:15.4085497Z         contiguous: bool,
2025-05-07T20:33:15.4085781Z         compiled: bool,
2025-05-07T20:33:15.4086044Z     ) -> None:
2025-05-07T20:33:15.4086300Z         torch.manual_seed(2025)
2025-05-07T20:33:15.4086588Z     
2025-05-07T20:33:15.4086903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.4087294Z     
2025-05-07T20:33:15.4087523Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.4087871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.4088224Z         x = x_sign * x_clamp
2025-05-07T20:33:15.4088503Z         x0 = x[:, :D]
2025-05-07T20:33:15.4088755Z         x1 = x[:, D:]
2025-05-07T20:33:15.4088996Z     
2025-05-07T20:33:15.4089216Z         if contiguous:
2025-05-07T20:33:15.4089489Z             x0 = x0.contiguous()
2025-05-07T20:33:15.4089784Z             x1 = x1.contiguous()
2025-05-07T20:33:15.4090067Z     
2025-05-07T20:33:15.4090293Z         if scale_ub is not None:
2025-05-07T20:33:15.4090610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.4090996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.4091355Z             )
2025-05-07T20:33:15.4091719Z         else:
2025-05-07T20:33:15.4091969Z             scale_ub_tensor = None
2025-05-07T20:33:15.4092262Z     
2025-05-07T20:33:15.4092529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.4092894Z             op = silu_mul_quant
2025-05-07T20:33:15.4093183Z             if compiled:
2025-05-07T20:33:15.4093473Z                 op = torch.compile(op)
2025-05-07T20:33:15.4093810Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.4094130Z     
2025-05-07T20:33:15.4094353Z         y_fp8, y_scale = fn()
2025-05-07T20:33:15.4094679Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:15.4095012Z     
2025-05-07T20:33:15.4095288Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.4095666Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:15.4096078Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:15.4096439Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:15.4096846Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.4097207Z     
2025-05-07T20:33:15.4097443Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:15.4097666Z 
2025-05-07T20:33:15.4097790Z moe/activation_test.py:126: 
2025-05-07T20:33:15.4098258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.4098646Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:15.4099021Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.4099913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:15.4100771Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:15.4101391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.4102172Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.4102951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:15.4103772Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.4104632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:15.4105482Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.4106306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:15.4107033Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:15.4107720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:15.4108312Z     fn()
2025-05-07T20:33:15.4108895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:15.4109557Z     self.fn.run(
2025-05-07T20:33:15.4110094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.4110697Z     kernel = self.compile(
2025-05-07T20:33:15.4111314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.4112118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.4112675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.4113003Z 
2025-05-07T20:33:15.4113298Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90601750>
2025-05-07T20:33:15.4114753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.4116332Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90b7eef0>}
2025-05-07T20:33:15.4117871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.4119028Z context = <triton._C.libtriton.ir.context object at 0x7efd90a20bf0>
2025-05-07T20:33:15.4119358Z 
2025-05-07T20:33:15.4119550Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.4120143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.4120733Z                            module_map=module_map)
2025-05-07T20:33:15.4121147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.4121558Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:15.4121866Z E       ^
2025-05-07T20:33:15.4122393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.4122960Z 
2025-05-07T20:33:15.4123476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.4124325Z 
2025-05-07T20:33:15.4124451Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.4124927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.4125381Z     T=1,
2025-05-07T20:33:15.4125599Z     D=5120,
2025-05-07T20:33:15.4125827Z     scale_ub=1200.0,
2025-05-07T20:33:15.4126086Z     contiguous=False,
2025-05-07T20:33:15.4126352Z     compiled=True,
2025-05-07T20:33:15.4126590Z )
2025-05-07T20:33:15.5959219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.5960073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.5960495Z 
2025-05-07T20:33:15.5960622Z     @given(
2025-05-07T20:33:15.5960990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.5961397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.5961747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.5962129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.5962506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.5962826Z     )
2025-05-07T20:33:15.5963228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.5963728Z     def test_silu_mul_quant(
2025-05-07T20:33:15.5964001Z         self,
2025-05-07T20:33:15.5964230Z         T: int,
2025-05-07T20:33:15.5964459Z         D: int,
2025-05-07T20:33:15.5964702Z         scale_ub: Optional[float],
2025-05-07T20:33:15.5965013Z         contiguous: bool,
2025-05-07T20:33:15.5965289Z         compiled: bool,
2025-05-07T20:33:15.5965543Z     ) -> None:
2025-05-07T20:33:15.5965792Z         torch.manual_seed(2025)
2025-05-07T20:33:15.5966070Z     
2025-05-07T20:33:15.5966380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.5966768Z     
2025-05-07T20:33:15.5966994Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.5967327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.5967676Z         x = x_sign * x_clamp
2025-05-07T20:33:15.5967952Z         x0 = x[:, :D]
2025-05-07T20:33:15.5968206Z         x1 = x[:, D:]
2025-05-07T20:33:15.5968445Z     
2025-05-07T20:33:15.5968658Z         if contiguous:
2025-05-07T20:33:15.5968927Z             x0 = x0.contiguous()
2025-05-07T20:33:15.5969218Z             x1 = x1.contiguous()
2025-05-07T20:33:15.5969498Z     
2025-05-07T20:33:15.5969720Z         if scale_ub is not None:
2025-05-07T20:33:15.5970029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.5970544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.5970899Z             )
2025-05-07T20:33:15.5971117Z         else:
2025-05-07T20:33:15.5971358Z             scale_ub_tensor = None
2025-05-07T20:33:15.5971649Z     
2025-05-07T20:33:15.5971911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.5972270Z             op = silu_mul_quant
2025-05-07T20:33:15.5972553Z             if compiled:
2025-05-07T20:33:15.5972839Z                 op = torch.compile(op)
2025-05-07T20:33:15.5973170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5973483Z     
2025-05-07T20:33:15.5973704Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.5973891Z 
2025-05-07T20:33:15.5974008Z moe/activation_test.py:117: 
2025-05-07T20:33:15.5974344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5974847Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.5975162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5975799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.5976428Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.5977232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.5978063Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.5978668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.5979435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.5980177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.5980775Z     kernel = self.compile(
2025-05-07T20:33:15.5981386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.5982127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.5982570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5982837Z 
2025-05-07T20:33:15.5983074Z self = <triton.compiler.compiler.ASTSource object at 0x7efd906416f0>
2025-05-07T20:33:15.5984284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.5985825Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90b7feb0>}
2025-05-07T20:33:15.5987332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.5988473Z context = <triton._C.libtriton.ir.context object at 0x7efd90a02a30>
2025-05-07T20:33:15.5988801Z 
2025-05-07T20:33:15.5988990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.5989579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.5990110Z                            module_map=module_map)
2025-05-07T20:33:15.5990728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.5991129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.5991425Z E       ^
2025-05-07T20:33:15.5991943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.5992453Z 
2025-05-07T20:33:15.5992916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.5993488Z 
2025-05-07T20:33:15.5993724Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.5994187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.5994632Z     T=1,
2025-05-07T20:33:15.5994845Z     D=5120,
2025-05-07T20:33:15.5995074Z     scale_ub=1200.0,
2025-05-07T20:33:15.5995330Z     contiguous=False,
2025-05-07T20:33:15.5995587Z     compiled=False,
2025-05-07T20:33:15.5995823Z )
2025-05-07T20:33:15.5996176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.5996725Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:15.5997024Z 
2025-05-07T20:33:15.5997118Z     @given(
2025-05-07T20:33:15.5997379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.5997735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.5998134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.5998506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.5998876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.5999202Z     )
2025-05-07T20:33:15.5999599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.6000166Z     def test_silu_mul_quant(
2025-05-07T20:33:15.6000442Z         self,
2025-05-07T20:33:15.6000708Z         T: int,
2025-05-07T20:33:15.6000930Z         D: int,
2025-05-07T20:33:15.6001179Z         scale_ub: Optional[float],
2025-05-07T20:33:15.6001490Z         contiguous: bool,
2025-05-07T20:33:15.6001766Z         compiled: bool,
2025-05-07T20:33:15.6002031Z     ) -> None:
2025-05-07T20:33:15.6002281Z         torch.manual_seed(2025)
2025-05-07T20:33:15.6002553Z     
2025-05-07T20:33:15.6002858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.6003244Z     
2025-05-07T20:33:15.6003461Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.6003786Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.6004135Z         x = x_sign * x_clamp
2025-05-07T20:33:15.6004405Z         x0 = x[:, :D]
2025-05-07T20:33:15.6004649Z         x1 = x[:, D:]
2025-05-07T20:33:15.6011360Z     
2025-05-07T20:33:15.6011588Z         if contiguous:
2025-05-07T20:33:15.6011884Z             x0 = x0.contiguous()
2025-05-07T20:33:15.6012184Z             x1 = x1.contiguous()
2025-05-07T20:33:15.6012457Z     
2025-05-07T20:33:15.6012676Z         if scale_ub is not None:
2025-05-07T20:33:15.6012981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.6013360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.6013712Z             )
2025-05-07T20:33:15.6013926Z         else:
2025-05-07T20:33:15.6014164Z             scale_ub_tensor = None
2025-05-07T20:33:15.6014445Z     
2025-05-07T20:33:15.6014704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.6015059Z             op = silu_mul_quant
2025-05-07T20:33:15.6015341Z             if compiled:
2025-05-07T20:33:15.6015619Z                 op = torch.compile(op)
2025-05-07T20:33:15.6015954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.6016260Z     
2025-05-07T20:33:15.6016470Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.6016662Z 
2025-05-07T20:33:15.6016777Z moe/activation_test.py:117: 
2025-05-07T20:33:15.6017117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.6017491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.6017801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.6018576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.6019352Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.6019948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.6020712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.6021530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.6022128Z     kernel = self.compile(
2025-05-07T20:33:15.6022732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.6023468Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.6024115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.6024369Z 
2025-05-07T20:33:15.6024605Z self = <triton.compiler.compiler.ASTSource object at 0x7efd904e1ae0>
2025-05-07T20:33:15.6025803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.6027425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a3e20>}
2025-05-07T20:33:15.6028979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.6030182Z context = <triton._C.libtriton.ir.context object at 0x7efd90a080b0>
2025-05-07T20:33:15.6030505Z 
2025-05-07T20:33:15.6030691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.6031275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.6031805Z                            module_map=module_map)
2025-05-07T20:33:15.6032226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.6032615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.6032911Z E       ^
2025-05-07T20:33:15.6033432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.6034026Z 
2025-05-07T20:33:15.6034493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.6035072Z 
2025-05-07T20:33:15.6035190Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.6035652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.6036103Z     T=16384,
2025-05-07T20:33:15.6036313Z     D=5120,
2025-05-07T20:33:15.6036534Z     scale_ub=1200.0,
2025-05-07T20:33:15.6036789Z     contiguous=False,
2025-05-07T20:33:15.6037036Z     compiled=True,
2025-05-07T20:33:15.6037266Z )
2025-05-07T20:33:15.7121205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.7122592Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.7123075Z 
2025-05-07T20:33:15.7123211Z     @given(
2025-05-07T20:33:15.7123573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.7124231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.7124590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.7124967Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.7125341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.7125666Z     )
2025-05-07T20:33:15.7126060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.7126557Z     def test_silu_mul_quant(
2025-05-07T20:33:15.7126839Z         self,
2025-05-07T20:33:15.7127060Z         T: int,
2025-05-07T20:33:15.7127284Z         D: int,
2025-05-07T20:33:15.7127532Z         scale_ub: Optional[float],
2025-05-07T20:33:15.7127838Z         contiguous: bool,
2025-05-07T20:33:15.7128113Z         compiled: bool,
2025-05-07T20:33:15.7128370Z     ) -> None:
2025-05-07T20:33:15.7128618Z         torch.manual_seed(2025)
2025-05-07T20:33:15.7129017Z     
2025-05-07T20:33:15.7129336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.7129724Z     
2025-05-07T20:33:15.7129945Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.7130280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.7130642Z         x = x_sign * x_clamp
2025-05-07T20:33:15.7130917Z         x0 = x[:, :D]
2025-05-07T20:33:15.7131163Z         x1 = x[:, D:]
2025-05-07T20:33:15.7131405Z     
2025-05-07T20:33:15.7131613Z         if contiguous:
2025-05-07T20:33:15.7131880Z             x0 = x0.contiguous()
2025-05-07T20:33:15.7132179Z             x1 = x1.contiguous()
2025-05-07T20:33:15.7132474Z     
2025-05-07T20:33:15.7132717Z         if scale_ub is not None:
2025-05-07T20:33:15.7133028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.7133481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.7133837Z             )
2025-05-07T20:33:15.7134064Z         else:
2025-05-07T20:33:15.7134310Z             scale_ub_tensor = None
2025-05-07T20:33:15.7134592Z     
2025-05-07T20:33:15.7134860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.7135227Z             op = silu_mul_quant
2025-05-07T20:33:15.7135580Z             if compiled:
2025-05-07T20:33:15.7135926Z                 op = torch.compile(op)
2025-05-07T20:33:15.7136265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7136574Z     
2025-05-07T20:33:15.7136799Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.7136988Z 
2025-05-07T20:33:15.7137111Z moe/activation_test.py:117: 
2025-05-07T20:33:15.7137446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7137821Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.7138151Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7138793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.7139427Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.7140174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.7140955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.7141560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.7142333Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.7143080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.7143686Z     kernel = self.compile(
2025-05-07T20:33:15.7144292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.7145036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.7145487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7145746Z 
2025-05-07T20:33:15.7145985Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90ae6dd0>
2025-05-07T20:33:15.7147197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.7148749Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd901288b0>}
2025-05-07T20:33:15.7150260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.7151412Z context = <triton._C.libtriton.ir.context object at 0x7efd901009b0>
2025-05-07T20:33:15.7151792Z 
2025-05-07T20:33:15.7151982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.7152574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.7153109Z                            module_map=module_map)
2025-05-07T20:33:15.7153590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.7153988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.7154282Z E       ^
2025-05-07T20:33:15.7154806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.7155314Z 
2025-05-07T20:33:15.7155781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.7156408Z 
2025-05-07T20:33:15.7156527Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.7157000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.7157453Z     T=2048,
2025-05-07T20:33:15.7157663Z     D=7168,
2025-05-07T20:33:15.7157883Z     scale_ub=1200.0,
2025-05-07T20:33:15.7158143Z     contiguous=False,
2025-05-07T20:33:15.7158443Z     compiled=True,
2025-05-07T20:33:15.7158676Z )
2025-05-07T20:33:15.7159076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.7159632Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.7159948Z 
2025-05-07T20:33:15.7160038Z     @given(
2025-05-07T20:33:15.7160302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.7160658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.7161001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.7161382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.7161758Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.7162076Z     )
2025-05-07T20:33:15.7162531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.7163030Z     def test_silu_mul_quant(
2025-05-07T20:33:15.7163301Z         self,
2025-05-07T20:33:15.7163522Z         T: int,
2025-05-07T20:33:15.7163753Z         D: int,
2025-05-07T20:33:15.7164001Z         scale_ub: Optional[float],
2025-05-07T20:33:15.7164307Z         contiguous: bool,
2025-05-07T20:33:15.7164579Z         compiled: bool,
2025-05-07T20:33:15.7164826Z     ) -> None:
2025-05-07T20:33:15.7165073Z         torch.manual_seed(2025)
2025-05-07T20:33:15.7165348Z     
2025-05-07T20:33:15.7165651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.7166042Z     
2025-05-07T20:33:15.7166266Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.7166600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.7166952Z         x = x_sign * x_clamp
2025-05-07T20:33:15.7167219Z         x0 = x[:, :D]
2025-05-07T20:33:15.7167463Z         x1 = x[:, D:]
2025-05-07T20:33:15.7167697Z     
2025-05-07T20:33:15.7167906Z         if contiguous:
2025-05-07T20:33:15.7168168Z             x0 = x0.contiguous()
2025-05-07T20:33:15.7168458Z             x1 = x1.contiguous()
2025-05-07T20:33:15.7168752Z     
2025-05-07T20:33:15.7168975Z         if scale_ub is not None:
2025-05-07T20:33:15.7169284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.7169656Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.7170010Z             )
2025-05-07T20:33:15.7170227Z         else:
2025-05-07T20:33:15.7170458Z             scale_ub_tensor = None
2025-05-07T20:33:15.7170742Z     
2025-05-07T20:33:15.7171007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.7171355Z             op = silu_mul_quant
2025-05-07T20:33:15.7171644Z             if compiled:
2025-05-07T20:33:15.7171929Z                 op = torch.compile(op)
2025-05-07T20:33:15.7172266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7172627Z     
2025-05-07T20:33:15.7172845Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.7173036Z 
2025-05-07T20:33:15.7173149Z moe/activation_test.py:117: 
2025-05-07T20:33:15.7173482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7173853Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.7174172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7174804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.7175427Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.7176172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.7176952Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.7177604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.7178370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.7179118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.7179792Z     kernel = self.compile(
2025-05-07T20:33:15.7180451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.7181185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.7181632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7181890Z 
2025-05-07T20:33:15.7182130Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9017f850>
2025-05-07T20:33:15.7183389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.7184937Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90129090>}
2025-05-07T20:33:15.7186447Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.7187600Z context = <triton._C.libtriton.ir.context object at 0x7efd901e48f0>
2025-05-07T20:33:15.7187925Z 
2025-05-07T20:33:15.7188118Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.7188699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.7189232Z                            module_map=module_map)
2025-05-07T20:33:15.7189647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.7190037Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.7190334Z E       ^
2025-05-07T20:33:15.7190859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.7191361Z 
2025-05-07T20:33:15.7191841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.7192443Z 
2025-05-07T20:33:15.8595747Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8596251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8596908Z     T=1,
2025-05-07T20:33:15.8597234Z     D=5120,
2025-05-07T20:33:15.8597533Z     scale_ub=None,
2025-05-07T20:33:15.8597886Z     contiguous=False,
2025-05-07T20:33:15.8598243Z     compiled=False,
2025-05-07T20:33:15.8598572Z )
2025-05-07T20:33:15.8599000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.8599675Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:15.8599974Z 
2025-05-07T20:33:15.8600068Z     @given(
2025-05-07T20:33:15.8600327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.8600678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.8601023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.8601395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.8601767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.8602088Z     )
2025-05-07T20:33:15.8602476Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.8602969Z     def test_silu_mul_quant(
2025-05-07T20:33:15.8603247Z         self,
2025-05-07T20:33:15.8603464Z         T: int,
2025-05-07T20:33:15.8603690Z         D: int,
2025-05-07T20:33:15.8604014Z         scale_ub: Optional[float],
2025-05-07T20:33:15.8604319Z         contiguous: bool,
2025-05-07T20:33:15.8604585Z         compiled: bool,
2025-05-07T20:33:15.8604836Z     ) -> None:
2025-05-07T20:33:15.8605089Z         torch.manual_seed(2025)
2025-05-07T20:33:15.8605358Z     
2025-05-07T20:33:15.8605664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.8606122Z     
2025-05-07T20:33:15.8606371Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.8606759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.8607108Z         x = x_sign * x_clamp
2025-05-07T20:33:15.8607378Z         x0 = x[:, :D]
2025-05-07T20:33:15.8607623Z         x1 = x[:, D:]
2025-05-07T20:33:15.8607856Z     
2025-05-07T20:33:15.8608067Z         if contiguous:
2025-05-07T20:33:15.8608330Z             x0 = x0.contiguous()
2025-05-07T20:33:15.8608617Z             x1 = x1.contiguous()
2025-05-07T20:33:15.8608890Z     
2025-05-07T20:33:15.8609110Z         if scale_ub is not None:
2025-05-07T20:33:15.8609416Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.8609790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.8610140Z             )
2025-05-07T20:33:15.8610357Z         else:
2025-05-07T20:33:15.8610596Z             scale_ub_tensor = None
2025-05-07T20:33:15.8610882Z     
2025-05-07T20:33:15.8611138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8611497Z             op = silu_mul_quant
2025-05-07T20:33:15.8611782Z             if compiled:
2025-05-07T20:33:15.8612065Z                 op = torch.compile(op)
2025-05-07T20:33:15.8612396Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8612705Z     
2025-05-07T20:33:15.8612928Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.8613115Z 
2025-05-07T20:33:15.8613225Z moe/activation_test.py:117: 
2025-05-07T20:33:15.8613555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8613930Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.8614241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8615018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.8615789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.8616388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.8617150Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.8617891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.8618486Z     kernel = self.compile(
2025-05-07T20:33:15.8619088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.8619820Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.8620265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8620520Z 
2025-05-07T20:33:15.8620811Z self = <triton.compiler.compiler.ASTSource object at 0x7efd901544f0>
2025-05-07T20:33:15.8622013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.8623543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd901297e0>}
2025-05-07T20:33:15.8625320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.8626463Z context = <triton._C.libtriton.ir.context object at 0x7efd900e1330>
2025-05-07T20:33:15.8626861Z 
2025-05-07T20:33:15.8627055Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.8627634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.8628162Z                            module_map=module_map)
2025-05-07T20:33:15.8628571Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.8629091Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.8629384Z E       ^
2025-05-07T20:33:15.8629904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.8630405Z 
2025-05-07T20:33:15.8630868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.8631437Z 
2025-05-07T20:33:15.8631553Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8632021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8632467Z     T=4096,
2025-05-07T20:33:15.8632677Z     D=7168,
2025-05-07T20:33:15.8632901Z     scale_ub=1200.0,
2025-05-07T20:33:15.8633159Z     contiguous=False,
2025-05-07T20:33:15.8633407Z     compiled=False,
2025-05-07T20:33:15.8633696Z )
2025-05-07T20:33:15.8634053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.8634617Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:15.8634926Z 
2025-05-07T20:33:15.8635014Z     @given(
2025-05-07T20:33:15.8635276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.8635628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.8635973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.8636346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.8636720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.8637040Z     )
2025-05-07T20:33:15.8637436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.8637939Z     def test_silu_mul_quant(
2025-05-07T20:33:15.8638213Z         self,
2025-05-07T20:33:15.8638431Z         T: int,
2025-05-07T20:33:15.8638656Z         D: int,
2025-05-07T20:33:15.8638905Z         scale_ub: Optional[float],
2025-05-07T20:33:15.8639212Z         contiguous: bool,
2025-05-07T20:33:15.8639485Z         compiled: bool,
2025-05-07T20:33:15.8639743Z     ) -> None:
2025-05-07T20:33:15.8639985Z         torch.manual_seed(2025)
2025-05-07T20:33:15.8640259Z     
2025-05-07T20:33:15.8640568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.8640947Z     
2025-05-07T20:33:15.8641166Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.8641495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.8641840Z         x = x_sign * x_clamp
2025-05-07T20:33:15.8642115Z         x0 = x[:, :D]
2025-05-07T20:33:15.8642357Z         x1 = x[:, D:]
2025-05-07T20:33:15.8642606Z     
2025-05-07T20:33:15.8642851Z         if contiguous:
2025-05-07T20:33:15.8643195Z             x0 = x0.contiguous()
2025-05-07T20:33:15.8643488Z             x1 = x1.contiguous()
2025-05-07T20:33:15.8643760Z     
2025-05-07T20:33:15.8643978Z         if scale_ub is not None:
2025-05-07T20:33:15.8644285Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.8644659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.8645008Z             )
2025-05-07T20:33:15.8645230Z         else:
2025-05-07T20:33:15.8645466Z             scale_ub_tensor = None
2025-05-07T20:33:15.8645746Z     
2025-05-07T20:33:15.8646007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.8646354Z             op = silu_mul_quant
2025-05-07T20:33:15.8646638Z             if compiled:
2025-05-07T20:33:15.8646917Z                 op = torch.compile(op)
2025-05-07T20:33:15.8647297Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8647608Z     
2025-05-07T20:33:15.8647832Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.8648018Z 
2025-05-07T20:33:15.8648134Z moe/activation_test.py:117: 
2025-05-07T20:33:15.8648468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8648845Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.8649162Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.8650033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.8650804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.8651405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.8652158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.8652899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.8653498Z     kernel = self.compile(
2025-05-07T20:33:15.8654109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.8654845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.8655290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.8655551Z 
2025-05-07T20:33:15.8655784Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90169cf0>
2025-05-07T20:33:15.8656979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.8658502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9012a200>}
2025-05-07T20:33:15.8660005Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.8661141Z context = <triton._C.libtriton.ir.context object at 0x7efd9001c770>
2025-05-07T20:33:15.8661465Z 
2025-05-07T20:33:15.8661654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.8662314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.8662978Z                            module_map=module_map)
2025-05-07T20:33:15.8669963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.8670379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.8670680Z E       ^
2025-05-07T20:33:15.8671213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.8671736Z 
2025-05-07T20:33:15.8672286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.8672864Z 
2025-05-07T20:33:15.8672988Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.8673459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.8673996Z     T=16384,
2025-05-07T20:33:15.8674220Z     D=7168,
2025-05-07T20:33:15.8674437Z     scale_ub=None,
2025-05-07T20:33:15.8674688Z     contiguous=True,
2025-05-07T20:33:15.8674945Z     compiled=True,
2025-05-07T20:33:15.8675172Z )
2025-05-07T20:33:16.0777834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0779323Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:16.0780071Z 
2025-05-07T20:33:16.0780293Z     @given(
2025-05-07T20:33:16.0780794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0781618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0782210Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0782648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0783035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0783353Z     )
2025-05-07T20:33:16.0783749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0784374Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0784645Z         self,
2025-05-07T20:33:16.0784865Z         T: int,
2025-05-07T20:33:16.0785088Z         D: int,
2025-05-07T20:33:16.0785328Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0785638Z         contiguous: bool,
2025-05-07T20:33:16.0785907Z         compiled: bool,
2025-05-07T20:33:16.0786161Z     ) -> None:
2025-05-07T20:33:16.0786398Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0786672Z     
2025-05-07T20:33:16.0786987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0787363Z     
2025-05-07T20:33:16.0787586Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0787922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0788268Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0788542Z         x0 = x[:, :D]
2025-05-07T20:33:16.0788794Z         x1 = x[:, D:]
2025-05-07T20:33:16.0789028Z     
2025-05-07T20:33:16.0789238Z         if contiguous:
2025-05-07T20:33:16.0789505Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0789791Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0790062Z     
2025-05-07T20:33:16.0790283Z         if scale_ub is not None:
2025-05-07T20:33:16.0790588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0790964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0791311Z             )
2025-05-07T20:33:16.0791530Z         else:
2025-05-07T20:33:16.0791762Z             scale_ub_tensor = None
2025-05-07T20:33:16.0792048Z     
2025-05-07T20:33:16.0792317Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0792703Z             op = silu_mul_quant
2025-05-07T20:33:16.0792993Z             if compiled:
2025-05-07T20:33:16.0793273Z                 op = torch.compile(op)
2025-05-07T20:33:16.0793674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0793984Z     
2025-05-07T20:33:16.0794206Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0794392Z 
2025-05-07T20:33:16.0794506Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0794837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0795214Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0795531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0796150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.0796775Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.0797515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0798356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0798965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0799722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0800464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0801056Z     kernel = self.compile(
2025-05-07T20:33:16.0801659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0802392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0802833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0803142Z 
2025-05-07T20:33:16.0803373Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90071e10>
2025-05-07T20:33:16.0804578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0806202Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9012b760>}
2025-05-07T20:33:16.0807695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0808824Z context = <triton._C.libtriton.ir.context object at 0x7efd90751570>
2025-05-07T20:33:16.0809150Z 
2025-05-07T20:33:16.0809337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0809929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0810459Z                            module_map=module_map)
2025-05-07T20:33:16.0810863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0811261Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0811560Z E       ^
2025-05-07T20:33:16.0812075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0812581Z 
2025-05-07T20:33:16.0813042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0813610Z 
2025-05-07T20:33:16.0813728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0814194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0814638Z     T=4096,
2025-05-07T20:33:16.0814854Z     D=5120,
2025-05-07T20:33:16.0815073Z     scale_ub=None,
2025-05-07T20:33:16.0815311Z     contiguous=False,
2025-05-07T20:33:16.0815565Z     compiled=True,
2025-05-07T20:33:16.0815794Z )
2025-05-07T20:33:16.0816144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0816693Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:16.0816999Z 
2025-05-07T20:33:16.0817090Z     @given(
2025-05-07T20:33:16.0817354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0817700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0818043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0818410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0818773Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0819092Z     )
2025-05-07T20:33:16.0819488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0819983Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0820256Z         self,
2025-05-07T20:33:16.0820477Z         T: int,
2025-05-07T20:33:16.0820695Z         D: int,
2025-05-07T20:33:16.0820993Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0821298Z         contiguous: bool,
2025-05-07T20:33:16.0821563Z         compiled: bool,
2025-05-07T20:33:16.0821814Z     ) -> None:
2025-05-07T20:33:16.0822066Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0822340Z     
2025-05-07T20:33:16.0822688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0823069Z     
2025-05-07T20:33:16.0823294Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0823616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0824257Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0824532Z         x0 = x[:, :D]
2025-05-07T20:33:16.0824773Z         x1 = x[:, D:]
2025-05-07T20:33:16.0825009Z     
2025-05-07T20:33:16.0825302Z         if contiguous:
2025-05-07T20:33:16.0825560Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0825854Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0826124Z     
2025-05-07T20:33:16.0826341Z         if scale_ub is not None:
2025-05-07T20:33:16.0826652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0827027Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0827442Z             )
2025-05-07T20:33:16.0827660Z         else:
2025-05-07T20:33:16.0827958Z             scale_ub_tensor = None
2025-05-07T20:33:16.0828237Z     
2025-05-07T20:33:16.0828499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0828852Z             op = silu_mul_quant
2025-05-07T20:33:16.0829134Z             if compiled:
2025-05-07T20:33:16.0829408Z                 op = torch.compile(op)
2025-05-07T20:33:16.0829739Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0830049Z     
2025-05-07T20:33:16.0830260Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0830452Z 
2025-05-07T20:33:16.0830563Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0830895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0831266Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0831591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0832215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.0832851Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.0833646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0834420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0835022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0835777Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0836521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0837111Z     kernel = self.compile(
2025-05-07T20:33:16.0837716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0838441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0838891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0839147Z 
2025-05-07T20:33:16.0839382Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079b4f0>
2025-05-07T20:33:16.0840584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0842111Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074c280>}
2025-05-07T20:33:16.0843676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0844809Z context = <triton._C.libtriton.ir.context object at 0x7efd907a0230>
2025-05-07T20:33:16.0845131Z 
2025-05-07T20:33:16.0845325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0845899Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0846419Z                            module_map=module_map)
2025-05-07T20:33:16.0846828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0847220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0847510Z E       ^
2025-05-07T20:33:16.0848074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0848574Z 
2025-05-07T20:33:16.0849044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0849611Z 
2025-05-07T20:33:16.4230557Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.4231394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.4232754Z     T=4096,
2025-05-07T20:33:16.4232983Z     D=5120,
2025-05-07T20:33:16.4233201Z     scale_ub=1200.0,
2025-05-07T20:33:16.4233448Z     contiguous=False,
2025-05-07T20:33:16.4233768Z     compiled=False,
2025-05-07T20:33:16.4233998Z )
2025-05-07T20:33:16.4234349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.4234907Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.4235216Z 
2025-05-07T20:33:16.4235317Z     @given(
2025-05-07T20:33:16.4235571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.4235917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.4236263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.4236637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.4236999Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.4237325Z     )
2025-05-07T20:33:16.4237719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.4238204Z     def test_silu_mul_quant(
2025-05-07T20:33:16.4238476Z         self,
2025-05-07T20:33:16.4238695Z         T: int,
2025-05-07T20:33:16.4238914Z         D: int,
2025-05-07T20:33:16.4239157Z         scale_ub: Optional[float],
2025-05-07T20:33:16.4239457Z         contiguous: bool,
2025-05-07T20:33:16.4239720Z         compiled: bool,
2025-05-07T20:33:16.4239970Z     ) -> None:
2025-05-07T20:33:16.4240213Z         torch.manual_seed(2025)
2025-05-07T20:33:16.4240477Z     
2025-05-07T20:33:16.4240782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.4241163Z     
2025-05-07T20:33:16.4241376Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.4241697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.4242043Z         x = x_sign * x_clamp
2025-05-07T20:33:16.4242309Z         x0 = x[:, :D]
2025-05-07T20:33:16.4242556Z         x1 = x[:, D:]
2025-05-07T20:33:16.4242790Z     
2025-05-07T20:33:16.4243002Z         if contiguous:
2025-05-07T20:33:16.4243256Z             x0 = x0.contiguous()
2025-05-07T20:33:16.4243548Z             x1 = x1.contiguous()
2025-05-07T20:33:16.4243828Z     
2025-05-07T20:33:16.4244043Z         if scale_ub is not None:
2025-05-07T20:33:16.4244350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.4244724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.4245063Z             )
2025-05-07T20:33:16.4245282Z         else:
2025-05-07T20:33:16.4245519Z             scale_ub_tensor = None
2025-05-07T20:33:16.4245795Z     
2025-05-07T20:33:16.4246054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4246482Z             op = silu_mul_quant
2025-05-07T20:33:16.4246760Z             if compiled:
2025-05-07T20:33:16.4247038Z                 op = torch.compile(op)
2025-05-07T20:33:16.4247369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4247676Z     
2025-05-07T20:33:16.4247889Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.4248075Z 
2025-05-07T20:33:16.4248186Z moe/activation_test.py:117: 
2025-05-07T20:33:16.4248516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4248883Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.4249197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4249961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.4250793Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.4251393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.4252147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.4252877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.4253565Z     kernel = self.compile(
2025-05-07T20:33:16.4254165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.4254891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.4255330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4255583Z 
2025-05-07T20:33:16.4255811Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079a830>
2025-05-07T20:33:16.4257002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.4258513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074d000>}
2025-05-07T20:33:16.4260001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.4261118Z context = <triton._C.libtriton.ir.context object at 0x7efd907e57b0>
2025-05-07T20:33:16.4261442Z 
2025-05-07T20:33:16.4261626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.4262201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.4262724Z                            module_map=module_map)
2025-05-07T20:33:16.4263124Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.4263515Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.4263799Z E       ^
2025-05-07T20:33:16.4264309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.4264814Z 
2025-05-07T20:33:16.4265274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.4265839Z 
2025-05-07T20:33:16.4265954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.4266412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.4266849Z     T=4096,
2025-05-07T20:33:16.4267059Z     D=5120,
2025-05-07T20:33:16.4267276Z     scale_ub=1200.0,
2025-05-07T20:33:16.4267526Z     contiguous=False,
2025-05-07T20:33:16.4267777Z     compiled=True,
2025-05-07T20:33:16.4268002Z )
2025-05-07T20:33:16.4268352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.4268944Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.4269251Z 
2025-05-07T20:33:16.4269339Z     @given(
2025-05-07T20:33:16.4269595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.4269943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.4270281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.4270648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.4271007Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.4271324Z     )
2025-05-07T20:33:16.4271712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.4272193Z     def test_silu_mul_quant(
2025-05-07T20:33:16.4272464Z         self,
2025-05-07T20:33:16.4272738Z         T: int,
2025-05-07T20:33:16.4272974Z         D: int,
2025-05-07T20:33:16.4273216Z         scale_ub: Optional[float],
2025-05-07T20:33:16.4273581Z         contiguous: bool,
2025-05-07T20:33:16.4273854Z         compiled: bool,
2025-05-07T20:33:16.4274097Z     ) -> None:
2025-05-07T20:33:16.4274336Z         torch.manual_seed(2025)
2025-05-07T20:33:16.4274601Z     
2025-05-07T20:33:16.4274896Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.4275364Z     
2025-05-07T20:33:16.4275580Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.4275895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.4276238Z         x = x_sign * x_clamp
2025-05-07T20:33:16.4276507Z         x0 = x[:, :D]
2025-05-07T20:33:16.4276742Z         x1 = x[:, D:]
2025-05-07T20:33:16.4276975Z     
2025-05-07T20:33:16.4277183Z         if contiguous:
2025-05-07T20:33:16.4277435Z             x0 = x0.contiguous()
2025-05-07T20:33:16.4277720Z             x1 = x1.contiguous()
2025-05-07T20:33:16.4277989Z     
2025-05-07T20:33:16.4278197Z         if scale_ub is not None:
2025-05-07T20:33:16.4278500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.4278868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.4279206Z             )
2025-05-07T20:33:16.4279414Z         else:
2025-05-07T20:33:16.4279647Z             scale_ub_tensor = None
2025-05-07T20:33:16.4279926Z     
2025-05-07T20:33:16.4280178Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4280523Z             op = silu_mul_quant
2025-05-07T20:33:16.4280800Z             if compiled:
2025-05-07T20:33:16.4281070Z                 op = torch.compile(op)
2025-05-07T20:33:16.4281399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4281702Z     
2025-05-07T20:33:16.4281909Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.4282097Z 
2025-05-07T20:33:16.4282206Z moe/activation_test.py:117: 
2025-05-07T20:33:16.4282563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4282952Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.4283291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4283910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.4284525Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.4285253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.4286011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.4286607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.4287359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.4288083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.4288673Z     kernel = self.compile(
2025-05-07T20:33:16.4289273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.4290042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.4290484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4290744Z 
2025-05-07T20:33:16.4290975Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90728cd0>
2025-05-07T20:33:16.4292165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.4293670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074c700>}
2025-05-07T20:33:16.4295220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.4296351Z context = <triton._C.libtriton.ir.context object at 0x7efd90291d70>
2025-05-07T20:33:16.4296668Z 
2025-05-07T20:33:16.4296857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.4297570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.4298087Z                            module_map=module_map)
2025-05-07T20:33:16.4298491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.4298882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.4299165Z E       ^
2025-05-07T20:33:16.4299678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.4300179Z 
2025-05-07T20:33:16.4300639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.4301201Z 
2025-05-07T20:33:16.5677149Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.5677658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.5678118Z     T=2048,
2025-05-07T20:33:16.5678337Z     D=7168,
2025-05-07T20:33:16.5678553Z     scale_ub=1200.0,
2025-05-07T20:33:16.5678803Z     contiguous=False,
2025-05-07T20:33:16.5679061Z     compiled=False,
2025-05-07T20:33:16.5679290Z )
2025-05-07T20:33:16.5679645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.5680200Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.5680508Z 
2025-05-07T20:33:16.5680600Z     @given(
2025-05-07T20:33:16.5680855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.5681207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.5681555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.5681922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.5682292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.5682614Z     )
2025-05-07T20:33:16.5683006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.5683497Z     def test_silu_mul_quant(
2025-05-07T20:33:16.5683771Z         self,
2025-05-07T20:33:16.5683993Z         T: int,
2025-05-07T20:33:16.5684224Z         D: int,
2025-05-07T20:33:16.5684468Z         scale_ub: Optional[float],
2025-05-07T20:33:16.5684779Z         contiguous: bool,
2025-05-07T20:33:16.5685055Z         compiled: bool,
2025-05-07T20:33:16.5685301Z     ) -> None:
2025-05-07T20:33:16.5685539Z         torch.manual_seed(2025)
2025-05-07T20:33:16.5685813Z     
2025-05-07T20:33:16.5686113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.5686494Z     
2025-05-07T20:33:16.5686716Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.5687034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.5687586Z         x = x_sign * x_clamp
2025-05-07T20:33:16.5694508Z         x0 = x[:, :D]
2025-05-07T20:33:16.5694794Z         x1 = x[:, D:]
2025-05-07T20:33:16.5695038Z     
2025-05-07T20:33:16.5695252Z         if contiguous:
2025-05-07T20:33:16.5695523Z             x0 = x0.contiguous()
2025-05-07T20:33:16.5695825Z             x1 = x1.contiguous()
2025-05-07T20:33:16.5696103Z     
2025-05-07T20:33:16.5696321Z         if scale_ub is not None:
2025-05-07T20:33:16.5696637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.5697017Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.5697362Z             )
2025-05-07T20:33:16.5697586Z         else:
2025-05-07T20:33:16.5697830Z             scale_ub_tensor = None
2025-05-07T20:33:16.5698117Z     
2025-05-07T20:33:16.5698489Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.5698845Z             op = silu_mul_quant
2025-05-07T20:33:16.5699133Z             if compiled:
2025-05-07T20:33:16.5699451Z                 op = torch.compile(op)
2025-05-07T20:33:16.5699785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5700096Z     
2025-05-07T20:33:16.5700316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.5700580Z 
2025-05-07T20:33:16.5700701Z moe/activation_test.py:117: 
2025-05-07T20:33:16.5701096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5701475Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.5701789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5702563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.5703334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.5703937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.5704849Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.5705598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.5706197Z     kernel = self.compile(
2025-05-07T20:33:16.5706808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.5707538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.5707979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5708233Z 
2025-05-07T20:33:16.5708469Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079c6a0>
2025-05-07T20:33:16.5709661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.5711196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074d240>}
2025-05-07T20:33:16.5712695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.5713909Z context = <triton._C.libtriton.ir.context object at 0x7efd9021b1b0>
2025-05-07T20:33:16.5714231Z 
2025-05-07T20:33:16.5714422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.5714999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.5715523Z                            module_map=module_map)
2025-05-07T20:33:16.5715933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.5716319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.5716616Z E       ^
2025-05-07T20:33:16.5717201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.5717703Z 
2025-05-07T20:33:16.5718174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.5718743Z 
2025-05-07T20:33:16.5718858Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.5719318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.5719766Z     T=1,
2025-05-07T20:33:16.5719972Z     D=7168,
2025-05-07T20:33:16.5720189Z     scale_ub=None,
2025-05-07T20:33:16.5720428Z     contiguous=True,
2025-05-07T20:33:16.5720675Z     compiled=False,
2025-05-07T20:33:16.5720905Z )
2025-05-07T20:33:16.5721312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.5721853Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:16.5722143Z 
2025-05-07T20:33:16.5722234Z     @given(
2025-05-07T20:33:16.5722496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.5722896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.5723235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.5723697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.5724402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.5724723Z     )
2025-05-07T20:33:16.5725115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.5725606Z     def test_silu_mul_quant(
2025-05-07T20:33:16.5725879Z         self,
2025-05-07T20:33:16.5726093Z         T: int,
2025-05-07T20:33:16.5726316Z         D: int,
2025-05-07T20:33:16.5726563Z         scale_ub: Optional[float],
2025-05-07T20:33:16.5726865Z         contiguous: bool,
2025-05-07T20:33:16.5727133Z         compiled: bool,
2025-05-07T20:33:16.5727384Z     ) -> None:
2025-05-07T20:33:16.5727625Z         torch.manual_seed(2025)
2025-05-07T20:33:16.5727895Z     
2025-05-07T20:33:16.5728200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.5728577Z     
2025-05-07T20:33:16.5728804Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.5729139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.5729485Z         x = x_sign * x_clamp
2025-05-07T20:33:16.5729756Z         x0 = x[:, :D]
2025-05-07T20:33:16.5730002Z         x1 = x[:, D:]
2025-05-07T20:33:16.5730233Z     
2025-05-07T20:33:16.5730444Z         if contiguous:
2025-05-07T20:33:16.5730707Z             x0 = x0.contiguous()
2025-05-07T20:33:16.5730992Z             x1 = x1.contiguous()
2025-05-07T20:33:16.5731265Z     
2025-05-07T20:33:16.5731480Z         if scale_ub is not None:
2025-05-07T20:33:16.5731791Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.5732158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.5732507Z             )
2025-05-07T20:33:16.5732730Z         else:
2025-05-07T20:33:16.5732965Z             scale_ub_tensor = None
2025-05-07T20:33:16.5733247Z     
2025-05-07T20:33:16.5733509Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.5733858Z             op = silu_mul_quant
2025-05-07T20:33:16.5734145Z             if compiled:
2025-05-07T20:33:16.5734424Z                 op = torch.compile(op)
2025-05-07T20:33:16.5734753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5735060Z     
2025-05-07T20:33:16.5735281Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.5735465Z 
2025-05-07T20:33:16.5735577Z moe/activation_test.py:117: 
2025-05-07T20:33:16.5735908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5736278Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.5736596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5737446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.5738220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.5738825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.5739587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.5740327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.5740925Z     kernel = self.compile(
2025-05-07T20:33:16.5741529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.5742252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.5742805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5743065Z 
2025-05-07T20:33:16.5743299Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90230d30>
2025-05-07T20:33:16.5744505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.5746153Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074e050>}
2025-05-07T20:33:16.5747644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.5748774Z context = <triton._C.libtriton.ir.context object at 0x7efd9031af70>
2025-05-07T20:33:16.5749097Z 
2025-05-07T20:33:16.5749289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.5749864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.5750386Z                            module_map=module_map)
2025-05-07T20:33:16.5750794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.5751195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.5751486Z E       ^
2025-05-07T20:33:16.5752005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.5752507Z 
2025-05-07T20:33:16.5752975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.5753616Z 
2025-05-07T20:33:16.5753742Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.5754197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.5754649Z     T=16384,
2025-05-07T20:33:16.5754869Z     D=7168,
2025-05-07T20:33:16.5755089Z     scale_ub=1200.0,
2025-05-07T20:33:16.5755343Z     contiguous=False,
2025-05-07T20:33:16.5755596Z     compiled=True,
2025-05-07T20:33:16.8496288Z )
2025-05-07T20:33:16.8496786Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.8497358Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.8497686Z 
2025-05-07T20:33:16.8497802Z     @given(
2025-05-07T20:33:16.8498160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.8498630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.8499042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.8499399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.8499743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.8500049Z     )
2025-05-07T20:33:16.8500459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.8500931Z     def test_silu_mul_quant(
2025-05-07T20:33:16.8501193Z         self,
2025-05-07T20:33:16.8501545Z         T: int,
2025-05-07T20:33:16.8501768Z         D: int,
2025-05-07T20:33:16.8502005Z         scale_ub: Optional[float],
2025-05-07T20:33:16.8502289Z         contiguous: bool,
2025-05-07T20:33:16.8502549Z         compiled: bool,
2025-05-07T20:33:16.8502794Z     ) -> None:
2025-05-07T20:33:16.8503022Z         torch.manual_seed(2025)
2025-05-07T20:33:16.8503285Z     
2025-05-07T20:33:16.8503579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.8503947Z     
2025-05-07T20:33:16.8504151Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.8504463Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.8504794Z         x = x_sign * x_clamp
2025-05-07T20:33:16.8505046Z         x0 = x[:, :D]
2025-05-07T20:33:16.8505357Z         x1 = x[:, D:]
2025-05-07T20:33:16.8505584Z     
2025-05-07T20:33:16.8505778Z         if contiguous:
2025-05-07T20:33:16.8506027Z             x0 = x0.contiguous()
2025-05-07T20:33:16.8506309Z             x1 = x1.contiguous()
2025-05-07T20:33:16.8506561Z     
2025-05-07T20:33:16.8506782Z         if scale_ub is not None:
2025-05-07T20:33:16.8507071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.8507514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.8507949Z             )
2025-05-07T20:33:16.8508156Z         else:
2025-05-07T20:33:16.8508389Z             scale_ub_tensor = None
2025-05-07T20:33:16.8508661Z     
2025-05-07T20:33:16.8508904Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8509239Z             op = silu_mul_quant
2025-05-07T20:33:16.8509507Z             if compiled:
2025-05-07T20:33:16.8509768Z                 op = torch.compile(op)
2025-05-07T20:33:16.8510088Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8510391Z     
2025-05-07T20:33:16.8510597Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.8510779Z 
2025-05-07T20:33:16.8510884Z moe/activation_test.py:117: 
2025-05-07T20:33:16.8511213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8511564Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.8511863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8512468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.8513113Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.8513942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.8514678Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.8515248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.8515984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.8516683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.8517255Z     kernel = self.compile(
2025-05-07T20:33:16.8517833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.8518530Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.8518956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8519203Z 
2025-05-07T20:33:16.8519427Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90344610>
2025-05-07T20:33:16.8520570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8522094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074f490>}
2025-05-07T20:33:16.8523512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8524932Z context = <triton._C.libtriton.ir.context object at 0x7efd9035b9b0>
2025-05-07T20:33:16.8525246Z 
2025-05-07T20:33:16.8525422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8525975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8526467Z                            module_map=module_map)
2025-05-07T20:33:16.8526857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8527232Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.8527591Z E       ^
2025-05-07T20:33:16.8528090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8528570Z 
2025-05-07T20:33:16.8529007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.8529545Z 
2025-05-07T20:33:16.8529728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.8530213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.8530644Z     T=1,
2025-05-07T20:33:16.8530847Z     D=7168,
2025-05-07T20:33:16.8531049Z     scale_ub=None,
2025-05-07T20:33:16.8531283Z     contiguous=False,
2025-05-07T20:33:16.8531530Z     compiled=False,
2025-05-07T20:33:16.8531747Z )
2025-05-07T20:33:16.8532088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.8532612Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:16.8532892Z 
2025-05-07T20:33:16.8532984Z     @given(
2025-05-07T20:33:16.8533227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.8533565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.8533894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.8534242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.8534601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.8534911Z     )
2025-05-07T20:33:16.8535281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.8535752Z     def test_silu_mul_quant(
2025-05-07T20:33:16.8536014Z         self,
2025-05-07T20:33:16.8536233Z         T: int,
2025-05-07T20:33:16.8536443Z         D: int,
2025-05-07T20:33:16.8536678Z         scale_ub: Optional[float],
2025-05-07T20:33:16.8536966Z         contiguous: bool,
2025-05-07T20:33:16.8537218Z         compiled: bool,
2025-05-07T20:33:16.8537460Z     ) -> None:
2025-05-07T20:33:16.8537688Z         torch.manual_seed(2025)
2025-05-07T20:33:16.8537941Z     
2025-05-07T20:33:16.8538238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.8538601Z     
2025-05-07T20:33:16.8538805Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.8539115Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.8539450Z         x = x_sign * x_clamp
2025-05-07T20:33:16.8539702Z         x0 = x[:, :D]
2025-05-07T20:33:16.8539943Z         x1 = x[:, D:]
2025-05-07T20:33:16.8540169Z     
2025-05-07T20:33:16.8540363Z         if contiguous:
2025-05-07T20:33:16.8540615Z             x0 = x0.contiguous()
2025-05-07T20:33:16.8540890Z             x1 = x1.contiguous()
2025-05-07T20:33:16.8541144Z     
2025-05-07T20:33:16.8541352Z         if scale_ub is not None:
2025-05-07T20:33:16.8541643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.8542000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.8542331Z             )
2025-05-07T20:33:16.8542540Z         else:
2025-05-07T20:33:16.8542763Z             scale_ub_tensor = None
2025-05-07T20:33:16.8543026Z     
2025-05-07T20:33:16.8543345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8543683Z             op = silu_mul_quant
2025-05-07T20:33:16.8543946Z             if compiled:
2025-05-07T20:33:16.8544213Z                 op = torch.compile(op)
2025-05-07T20:33:16.8544535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8544822Z     
2025-05-07T20:33:16.8545031Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.8545206Z 
2025-05-07T20:33:16.8545320Z moe/activation_test.py:117: 
2025-05-07T20:33:16.8545633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8545987Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.8546287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8547025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.8547945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.8548722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.8549447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.8550263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.8550828Z     kernel = self.compile(
2025-05-07T20:33:16.8551411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.8552109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.8552528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8552787Z 
2025-05-07T20:33:16.8553010Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9003ba30>
2025-05-07T20:33:16.8554227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8555695Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074f7f0>}
2025-05-07T20:33:16.8557138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8558212Z context = <triton._C.libtriton.ir.context object at 0x7efd903a9430>
2025-05-07T20:33:16.8558639Z 
2025-05-07T20:33:16.8558898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8559562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8560067Z                            module_map=module_map)
2025-05-07T20:33:16.8560449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8560820Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.8561098Z E       ^
2025-05-07T20:33:16.8561586Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8562068Z 
2025-05-07T20:33:16.8562504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.8563047Z 
2025-05-07T20:33:16.8563157Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.8563598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.8564018Z     T=2048,
2025-05-07T20:33:16.8564222Z     D=7168,
2025-05-07T20:33:16.8564432Z     scale_ub=None,
2025-05-07T20:33:16.8564658Z     contiguous=False,
2025-05-07T20:33:16.8564898Z     compiled=True,
2025-05-07T20:33:16.8565117Z )
2025-05-07T20:33:16.9567399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.9568037Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:16.9568328Z 
2025-05-07T20:33:16.9568421Z     @given(
2025-05-07T20:33:16.9568675Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.9569011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.9569337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.9569694Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.9570049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.9570355Z     )
2025-05-07T20:33:16.9570725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.9571273Z     def test_silu_mul_quant(
2025-05-07T20:33:16.9571536Z         self,
2025-05-07T20:33:16.9571743Z         T: int,
2025-05-07T20:33:16.9571959Z         D: int,
2025-05-07T20:33:16.9572202Z         scale_ub: Optional[float],
2025-05-07T20:33:16.9572488Z         contiguous: bool,
2025-05-07T20:33:16.9572750Z         compiled: bool,
2025-05-07T20:33:16.9572997Z     ) -> None:
2025-05-07T20:33:16.9573226Z         torch.manual_seed(2025)
2025-05-07T20:33:16.9573561Z     
2025-05-07T20:33:16.9573923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.9574287Z     
2025-05-07T20:33:16.9574503Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.9574822Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.9575149Z         x = x_sign * x_clamp
2025-05-07T20:33:16.9575412Z         x0 = x[:, :D]
2025-05-07T20:33:16.9575660Z         x1 = x[:, D:]
2025-05-07T20:33:16.9575885Z     
2025-05-07T20:33:16.9576086Z         if contiguous:
2025-05-07T20:33:16.9576346Z             x0 = x0.contiguous()
2025-05-07T20:33:16.9576623Z             x1 = x1.contiguous()
2025-05-07T20:33:16.9576878Z     
2025-05-07T20:33:16.9577091Z         if scale_ub is not None:
2025-05-07T20:33:16.9577388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.9577900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.9578388Z             )
2025-05-07T20:33:16.9578681Z         else:
2025-05-07T20:33:16.9578995Z             scale_ub_tensor = None
2025-05-07T20:33:16.9579353Z     
2025-05-07T20:33:16.9579606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.9579933Z             op = silu_mul_quant
2025-05-07T20:33:16.9580199Z             if compiled:
2025-05-07T20:33:16.9580466Z                 op = torch.compile(op)
2025-05-07T20:33:16.9580779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9581074Z     
2025-05-07T20:33:16.9581284Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.9581460Z 
2025-05-07T20:33:16.9581575Z moe/activation_test.py:117: 
2025-05-07T20:33:16.9581884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9582237Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.9582543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9583183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.9583781Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.9584481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.9585209Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.9585770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.9586491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.9587192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.9587751Z     kernel = self.compile(
2025-05-07T20:33:16.9588401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.9589103Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.9589531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9589776Z 
2025-05-07T20:33:16.9590000Z self = <triton.compiler.compiler.ASTSource object at 0x7efd902a8520>
2025-05-07T20:33:16.9591141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.9592598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1caf0>}
2025-05-07T20:33:16.9594167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.9595233Z context = <triton._C.libtriton.ir.context object at 0x7efacbf62330>
2025-05-07T20:33:16.9595616Z 
2025-05-07T20:33:16.9595833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.9596390Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.9596886Z                            module_map=module_map)
2025-05-07T20:33:16.9604321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.9604715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.9604998Z E       ^
2025-05-07T20:33:16.9605490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.9605974Z 
2025-05-07T20:33:16.9606415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.9606958Z 
2025-05-07T20:33:16.9607069Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.9607508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.9607925Z     T=4096,
2025-05-07T20:33:16.9608129Z     D=7168,
2025-05-07T20:33:16.9608339Z     scale_ub=None,
2025-05-07T20:33:16.9608564Z     contiguous=False,
2025-05-07T20:33:16.9608812Z     compiled=True,
2025-05-07T20:33:16.9609029Z )
2025-05-07T20:33:16.9609361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.9609880Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:16.9610169Z 
2025-05-07T20:33:16.9610252Z     @given(
2025-05-07T20:33:16.9610502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.9610830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.9611159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.9611510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.9611855Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.9612161Z     )
2025-05-07T20:33:16.9612536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.9613000Z     def test_silu_mul_quant(
2025-05-07T20:33:16.9613259Z         self,
2025-05-07T20:33:16.9613469Z         T: int,
2025-05-07T20:33:16.9613675Z         D: int,
2025-05-07T20:33:16.9613908Z         scale_ub: Optional[float],
2025-05-07T20:33:16.9614199Z         contiguous: bool,
2025-05-07T20:33:16.9614457Z         compiled: bool,
2025-05-07T20:33:16.9614692Z     ) -> None:
2025-05-07T20:33:16.9614923Z         torch.manual_seed(2025)
2025-05-07T20:33:16.9615180Z     
2025-05-07T20:33:16.9615467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.9615831Z     
2025-05-07T20:33:16.9616043Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.9616428Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.9616760Z         x = x_sign * x_clamp
2025-05-07T20:33:16.9617022Z         x0 = x[:, :D]
2025-05-07T20:33:16.9617251Z         x1 = x[:, D:]
2025-05-07T20:33:16.9617481Z     
2025-05-07T20:33:16.9617685Z         if contiguous:
2025-05-07T20:33:16.9617932Z             x0 = x0.contiguous()
2025-05-07T20:33:16.9618210Z             x1 = x1.contiguous()
2025-05-07T20:33:16.9618468Z     
2025-05-07T20:33:16.9618668Z         if scale_ub is not None:
2025-05-07T20:33:16.9618959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.9619316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.9619638Z             )
2025-05-07T20:33:16.9619846Z         else:
2025-05-07T20:33:16.9620120Z             scale_ub_tensor = None
2025-05-07T20:33:16.9620388Z     
2025-05-07T20:33:16.9620630Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.9620963Z             op = silu_mul_quant
2025-05-07T20:33:16.9621231Z             if compiled:
2025-05-07T20:33:16.9621493Z                 op = torch.compile(op)
2025-05-07T20:33:16.9621811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9622147Z     
2025-05-07T20:33:16.9622349Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.9622569Z 
2025-05-07T20:33:16.9622675Z moe/activation_test.py:117: 
2025-05-07T20:33:16.9622992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9623335Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.9623634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9624595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.9625183Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.9625873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.9626602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.9627169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.9627876Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.9628574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.9629131Z     kernel = self.compile(
2025-05-07T20:33:16.9629699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.9630380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.9630798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9631040Z 
2025-05-07T20:33:16.9631266Z self = <triton.compiler.compiler.ASTSource object at 0x7efd903dc4f0>
2025-05-07T20:33:16.9632397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.9633958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1c280>}
2025-05-07T20:33:16.9635367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.9636443Z context = <triton._C.libtriton.ir.context object at 0x7efacbfb11b0>
2025-05-07T20:33:16.9636748Z 
2025-05-07T20:33:16.9636930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.9637473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.9638072Z                            module_map=module_map)
2025-05-07T20:33:16.9638462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.9638837Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.9639111Z E       ^
2025-05-07T20:33:16.9639605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.9640073Z 
2025-05-07T20:33:16.9640516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.9641051Z 
2025-05-07T20:33:17.3089239Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.3089790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.3090561Z     T=16384,
2025-05-07T20:33:17.3090849Z     D=5120,
2025-05-07T20:33:17.3091145Z     scale_ub=1200.0,
2025-05-07T20:33:17.3091439Z     contiguous=False,
2025-05-07T20:33:17.3091683Z     compiled=False,
2025-05-07T20:33:17.3091910Z )
2025-05-07T20:33:17.3092253Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.3092787Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:17.3093199Z 
2025-05-07T20:33:17.3093318Z     @given(
2025-05-07T20:33:17.3093621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.3093958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.3094291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.3094638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.3094992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.3095302Z     )
2025-05-07T20:33:17.3095671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.3096148Z     def test_silu_mul_quant(
2025-05-07T20:33:17.3096408Z         self,
2025-05-07T20:33:17.3096619Z         T: int,
2025-05-07T20:33:17.3096831Z         D: int,
2025-05-07T20:33:17.3097069Z         scale_ub: Optional[float],
2025-05-07T20:33:17.3097363Z         contiguous: bool,
2025-05-07T20:33:17.3097617Z         compiled: bool,
2025-05-07T20:33:17.3097867Z     ) -> None:
2025-05-07T20:33:17.3098103Z         torch.manual_seed(2025)
2025-05-07T20:33:17.3098365Z     
2025-05-07T20:33:17.3098659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.3099026Z     
2025-05-07T20:33:17.3099232Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.3099552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.3099885Z         x = x_sign * x_clamp
2025-05-07T20:33:17.3100141Z         x0 = x[:, :D]
2025-05-07T20:33:17.3100377Z         x1 = x[:, D:]
2025-05-07T20:33:17.3100603Z     
2025-05-07T20:33:17.3100802Z         if contiguous:
2025-05-07T20:33:17.3101054Z             x0 = x0.contiguous()
2025-05-07T20:33:17.3101324Z             x1 = x1.contiguous()
2025-05-07T20:33:17.3101576Z     
2025-05-07T20:33:17.3101785Z         if scale_ub is not None:
2025-05-07T20:33:17.3102076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.3102426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.3102776Z             )
2025-05-07T20:33:17.3103014Z         else:
2025-05-07T20:33:17.3103268Z             scale_ub_tensor = None
2025-05-07T20:33:17.3103535Z     
2025-05-07T20:33:17.3103784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.3104119Z             op = silu_mul_quant
2025-05-07T20:33:17.3104384Z             if compiled:
2025-05-07T20:33:17.3104657Z                 op = torch.compile(op)
2025-05-07T20:33:17.3104976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3105264Z     
2025-05-07T20:33:17.3105474Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.3105651Z 
2025-05-07T20:33:17.3105764Z moe/activation_test.py:117: 
2025-05-07T20:33:17.3106210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3106566Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.3106876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3107613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.3108342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.3108913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.3109639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.3110338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.3110907Z     kernel = self.compile(
2025-05-07T20:33:17.3111530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.3112225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.3112642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3112889Z 
2025-05-07T20:33:17.3113108Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf88d60>
2025-05-07T20:33:17.3114427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.3115897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1ed40>}
2025-05-07T20:33:17.3117323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.3118410Z context = <triton._C.libtriton.ir.context object at 0x7efacbdca230>
2025-05-07T20:33:17.3118722Z 
2025-05-07T20:33:17.3118899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.3119456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.3119948Z                            module_map=module_map)
2025-05-07T20:33:17.3120339Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.3120717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.3120993Z E       ^
2025-05-07T20:33:17.3121483Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.3121961Z 
2025-05-07T20:33:17.3122403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.3122946Z 
2025-05-07T20:33:17.3123071Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.3123511Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.3124179Z     T=16384,
2025-05-07T20:33:17.3124391Z     D=5120,
2025-05-07T20:33:17.3124610Z     scale_ub=1200.0,
2025-05-07T20:33:17.3124844Z     contiguous=True,
2025-05-07T20:33:17.3125089Z     compiled=True,
2025-05-07T20:33:17.3125312Z )
2025-05-07T20:33:17.3125647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.3126172Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.3126469Z 
2025-05-07T20:33:17.3126554Z     @given(
2025-05-07T20:33:17.3126801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.3127132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.3127462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.3127816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.3128239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.3128545Z     )
2025-05-07T20:33:17.3128919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.3129384Z     def test_silu_mul_quant(
2025-05-07T20:33:17.3129647Z         self,
2025-05-07T20:33:17.3129859Z         T: int,
2025-05-07T20:33:17.3130071Z         D: int,
2025-05-07T20:33:17.3130298Z         scale_ub: Optional[float],
2025-05-07T20:33:17.3130586Z         contiguous: bool,
2025-05-07T20:33:17.3130840Z         compiled: bool,
2025-05-07T20:33:17.3131073Z     ) -> None:
2025-05-07T20:33:17.3131304Z         torch.manual_seed(2025)
2025-05-07T20:33:17.3131561Z     
2025-05-07T20:33:17.3131846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.3132274Z     
2025-05-07T20:33:17.3132482Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.3132785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.3133113Z         x = x_sign * x_clamp
2025-05-07T20:33:17.3133374Z         x0 = x[:, :D]
2025-05-07T20:33:17.3133604Z         x1 = x[:, D:]
2025-05-07T20:33:17.3133826Z     
2025-05-07T20:33:17.3134027Z         if contiguous:
2025-05-07T20:33:17.3134268Z             x0 = x0.contiguous()
2025-05-07T20:33:17.3134613Z             x1 = x1.contiguous()
2025-05-07T20:33:17.3134927Z     
2025-05-07T20:33:17.3135133Z         if scale_ub is not None:
2025-05-07T20:33:17.3135424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.3135782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.3136116Z             )
2025-05-07T20:33:17.3136327Z         else:
2025-05-07T20:33:17.3136554Z             scale_ub_tensor = None
2025-05-07T20:33:17.3136823Z     
2025-05-07T20:33:17.3137066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.3137403Z             op = silu_mul_quant
2025-05-07T20:33:17.3137667Z             if compiled:
2025-05-07T20:33:17.3137933Z                 op = torch.compile(op)
2025-05-07T20:33:17.3138251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3138545Z     
2025-05-07T20:33:17.3138747Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.3138929Z 
2025-05-07T20:33:17.3139037Z moe/activation_test.py:117: 
2025-05-07T20:33:17.3139355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3139699Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.3139999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3140592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.3141183Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.3141880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.3142612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.3143186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.3143904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.3144598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.3145160Z     kernel = self.compile(
2025-05-07T20:33:17.3145729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.3146411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.3146827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3147072Z 
2025-05-07T20:33:17.3147289Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf244c0>
2025-05-07T20:33:17.3148474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.3149906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1e830>}
2025-05-07T20:33:17.3151313Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.3152382Z context = <triton._C.libtriton.ir.context object at 0x7efacbd5f470>
2025-05-07T20:33:17.3152706Z 
2025-05-07T20:33:17.3152917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.3153467Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.3154074Z                            module_map=module_map)
2025-05-07T20:33:17.3154462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.3154834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.3155105Z E       ^
2025-05-07T20:33:17.3155597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.3156142Z 
2025-05-07T20:33:17.3156619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.3157263Z 
2025-05-07T20:33:17.5057240Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.5058481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.5059566Z     T=16384,
2025-05-07T20:33:17.5059959Z     D=5120,
2025-05-07T20:33:17.5060337Z     scale_ub=None,
2025-05-07T20:33:17.5060767Z     contiguous=False,
2025-05-07T20:33:17.5061206Z     compiled=True,
2025-05-07T20:33:17.5061607Z )
2025-05-07T20:33:17.5062222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.5063174Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:17.5063523Z 
2025-05-07T20:33:17.5063610Z     @given(
2025-05-07T20:33:17.5063862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.5064201Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.5064531Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.5064887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.5065233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.5065542Z     )
2025-05-07T20:33:17.5065920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.5066386Z     def test_silu_mul_quant(
2025-05-07T20:33:17.5066649Z         self,
2025-05-07T20:33:17.5066863Z         T: int,
2025-05-07T20:33:17.5067076Z         D: int,
2025-05-07T20:33:17.5067315Z         scale_ub: Optional[float],
2025-05-07T20:33:17.5067610Z         contiguous: bool,
2025-05-07T20:33:17.5067866Z         compiled: bool,
2025-05-07T20:33:17.5068107Z     ) -> None:
2025-05-07T20:33:17.5068344Z         torch.manual_seed(2025)
2025-05-07T20:33:17.5068611Z     
2025-05-07T20:33:17.5068897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.5069266Z     
2025-05-07T20:33:17.5069476Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.5069781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.5070110Z         x = x_sign * x_clamp
2025-05-07T20:33:17.5070368Z         x0 = x[:, :D]
2025-05-07T20:33:17.5070593Z         x1 = x[:, D:]
2025-05-07T20:33:17.5070816Z     
2025-05-07T20:33:17.5071016Z         if contiguous:
2025-05-07T20:33:17.5071260Z             x0 = x0.contiguous()
2025-05-07T20:33:17.5071539Z             x1 = x1.contiguous()
2025-05-07T20:33:17.5071795Z     
2025-05-07T20:33:17.5071998Z         if scale_ub is not None:
2025-05-07T20:33:17.5072432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.5072798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.5073120Z             )
2025-05-07T20:33:17.5073331Z         else:
2025-05-07T20:33:17.5073642Z             scale_ub_tensor = None
2025-05-07T20:33:17.5073917Z     
2025-05-07T20:33:17.5074165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.5074502Z             op = silu_mul_quant
2025-05-07T20:33:17.5074770Z             if compiled:
2025-05-07T20:33:17.5075029Z                 op = torch.compile(op)
2025-05-07T20:33:17.5075343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5075637Z     
2025-05-07T20:33:17.5075839Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.5076024Z 
2025-05-07T20:33:17.5076131Z moe/activation_test.py:117: 
2025-05-07T20:33:17.5076524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5076870Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.5077178Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5077771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.5078369Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.5079207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.5079940Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.5080510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.5081233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.5081936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.5082508Z     kernel = self.compile(
2025-05-07T20:33:17.5083117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.5083800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.5084220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5084462Z 
2025-05-07T20:33:17.5084690Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf55270>
2025-05-07T20:33:17.5085825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.5087276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1f760>}
2025-05-07T20:33:17.5088692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.5089766Z context = <triton._C.libtriton.ir.context object at 0x7efacbc19bb0>
2025-05-07T20:33:17.5090070Z 
2025-05-07T20:33:17.5090254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.5090800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.5091290Z                            module_map=module_map)
2025-05-07T20:33:17.5091676Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.5092046Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.5092315Z E       ^
2025-05-07T20:33:17.5092809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.5093281Z 
2025-05-07T20:33:17.5093720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.5094302Z 
2025-05-07T20:33:17.5094421Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.5094849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.5095378Z     T=2048,
2025-05-07T20:33:17.5095655Z     D=5120,
2025-05-07T20:33:17.5095862Z     scale_ub=None,
2025-05-07T20:33:17.5096093Z     contiguous=False,
2025-05-07T20:33:17.5096334Z     compiled=True,
2025-05-07T20:33:17.5096547Z )
2025-05-07T20:33:17.6138206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.6139057Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:17.6139466Z 
2025-05-07T20:33:17.6139592Z     @given(
2025-05-07T20:33:17.6139855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.6140311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.6140642Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.6141001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.6141353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.6141655Z     )
2025-05-07T20:33:17.6142033Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.6142578Z     def test_silu_mul_quant(
2025-05-07T20:33:17.6142893Z         self,
2025-05-07T20:33:17.6143107Z         T: int,
2025-05-07T20:33:17.6143322Z         D: int,
2025-05-07T20:33:17.6143564Z         scale_ub: Optional[float],
2025-05-07T20:33:17.6143858Z         contiguous: bool,
2025-05-07T20:33:17.6144120Z         compiled: bool,
2025-05-07T20:33:17.6144363Z     ) -> None:
2025-05-07T20:33:17.6144597Z         torch.manual_seed(2025)
2025-05-07T20:33:17.6144857Z     
2025-05-07T20:33:17.6145142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.6145512Z     
2025-05-07T20:33:17.6145723Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.6146031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.6146364Z         x = x_sign * x_clamp
2025-05-07T20:33:17.6146623Z         x0 = x[:, :D]
2025-05-07T20:33:17.6154167Z         x1 = x[:, D:]
2025-05-07T20:33:17.6154411Z     
2025-05-07T20:33:17.6154619Z         if contiguous:
2025-05-07T20:33:17.6154872Z             x0 = x0.contiguous()
2025-05-07T20:33:17.6155152Z             x1 = x1.contiguous()
2025-05-07T20:33:17.6155405Z     
2025-05-07T20:33:17.6155614Z         if scale_ub is not None:
2025-05-07T20:33:17.6155910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.6156267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.6156598Z             )
2025-05-07T20:33:17.6156807Z         else:
2025-05-07T20:33:17.6157031Z             scale_ub_tensor = None
2025-05-07T20:33:17.6157302Z     
2025-05-07T20:33:17.6157564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6157912Z             op = silu_mul_quant
2025-05-07T20:33:17.6158178Z             if compiled:
2025-05-07T20:33:17.6158444Z                 op = torch.compile(op)
2025-05-07T20:33:17.6158760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6159047Z     
2025-05-07T20:33:17.6159254Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.6159433Z 
2025-05-07T20:33:17.6159547Z moe/activation_test.py:117: 
2025-05-07T20:33:17.6159856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6160210Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.6160511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6161099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.6161691Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.6162389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.6163111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.6163775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.6164499Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.6165203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.6165761Z     kernel = self.compile(
2025-05-07T20:33:17.6166330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.6167022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.6167441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6167683Z 
2025-05-07T20:33:17.6167947Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbdc3ca0>
2025-05-07T20:33:17.6169086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.6170566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce43a0>}
2025-05-07T20:33:17.6172013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.6173140Z context = <triton._C.libtriton.ir.context object at 0x7efacbc887f0>
2025-05-07T20:33:17.6173448Z 
2025-05-07T20:33:17.6173625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.6174176Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.6174673Z                            module_map=module_map)
2025-05-07T20:33:17.6175061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.6175428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.6175705Z E       ^
2025-05-07T20:33:17.6176201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.6176672Z 
2025-05-07T20:33:17.6177107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.6177650Z 
2025-05-07T20:33:17.6177762Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.6178197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.6178621Z     T=2048,
2025-05-07T20:33:17.6178825Z     D=5120,
2025-05-07T20:33:17.6179035Z     scale_ub=1200.0,
2025-05-07T20:33:17.6179279Z     contiguous=False,
2025-05-07T20:33:17.6179515Z     compiled=True,
2025-05-07T20:33:17.6179734Z )
2025-05-07T20:33:17.6180075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.6180594Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:17.6180886Z 
2025-05-07T20:33:17.6180972Z     @given(
2025-05-07T20:33:17.6181223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.6181550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.6181883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.6182234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.6182584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.6182886Z     )
2025-05-07T20:33:17.6183310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.6183777Z     def test_silu_mul_quant(
2025-05-07T20:33:17.6184032Z         self,
2025-05-07T20:33:17.6184242Z         T: int,
2025-05-07T20:33:17.6184454Z         D: int,
2025-05-07T20:33:17.6184733Z         scale_ub: Optional[float],
2025-05-07T20:33:17.6185025Z         contiguous: bool,
2025-05-07T20:33:17.6185284Z         compiled: bool,
2025-05-07T20:33:17.6185518Z     ) -> None:
2025-05-07T20:33:17.6185749Z         torch.manual_seed(2025)
2025-05-07T20:33:17.6186011Z     
2025-05-07T20:33:17.6186305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.6186672Z     
2025-05-07T20:33:17.6186885Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.6187196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.6187526Z         x = x_sign * x_clamp
2025-05-07T20:33:17.6187783Z         x0 = x[:, :D]
2025-05-07T20:33:17.6188016Z         x1 = x[:, D:]
2025-05-07T20:33:17.6188237Z     
2025-05-07T20:33:17.6188438Z         if contiguous:
2025-05-07T20:33:17.6188735Z             x0 = x0.contiguous()
2025-05-07T20:33:17.6189005Z             x1 = x1.contiguous()
2025-05-07T20:33:17.6189253Z     
2025-05-07T20:33:17.6189459Z         if scale_ub is not None:
2025-05-07T20:33:17.6189752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.6190105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.6190433Z             )
2025-05-07T20:33:17.6190683Z         else:
2025-05-07T20:33:17.6190909Z             scale_ub_tensor = None
2025-05-07T20:33:17.6191222Z     
2025-05-07T20:33:17.6191467Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6191801Z             op = silu_mul_quant
2025-05-07T20:33:17.6192068Z             if compiled:
2025-05-07T20:33:17.6192325Z                 op = torch.compile(op)
2025-05-07T20:33:17.6192644Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6192979Z     
2025-05-07T20:33:17.6193191Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.6193371Z 
2025-05-07T20:33:17.6193475Z moe/activation_test.py:117: 
2025-05-07T20:33:17.6193843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6194197Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.6194493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6195085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.6195679Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.6196370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.6197095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.6197663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.6198381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.6199079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.6199639Z     kernel = self.compile(
2025-05-07T20:33:17.6200213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.6200901Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.6201316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6201567Z 
2025-05-07T20:33:17.6201787Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbcf03a0>
2025-05-07T20:33:17.6202919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.6204409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce4820>}
2025-05-07T20:33:17.6205992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.6207101Z context = <triton._C.libtriton.ir.context object at 0x7efacbe3b0f0>
2025-05-07T20:33:17.6207430Z 
2025-05-07T20:33:17.6207610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.6208169Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.6208664Z                            module_map=module_map)
2025-05-07T20:33:17.6209055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.6209430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.6209707Z E       ^
2025-05-07T20:33:17.6210205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.6210740Z 
2025-05-07T20:33:17.6211179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.6211716Z 
2025-05-07T20:33:17.9870410Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.9871119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.9871866Z     T=4096,
2025-05-07T20:33:17.9872179Z     D=5120,
2025-05-07T20:33:17.9872397Z     scale_ub=1200.0,
2025-05-07T20:33:17.9872635Z     contiguous=True,
2025-05-07T20:33:17.9872882Z     compiled=True,
2025-05-07T20:33:17.9873103Z )
2025-05-07T20:33:17.9873449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.9874078Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.9874384Z 
2025-05-07T20:33:17.9874471Z     @given(
2025-05-07T20:33:17.9874721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.9875054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.9875383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.9875736Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.9876102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.9876509Z     )
2025-05-07T20:33:17.9876894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.9877367Z     def test_silu_mul_quant(
2025-05-07T20:33:17.9877621Z         self,
2025-05-07T20:33:17.9877830Z         T: int,
2025-05-07T20:33:17.9878042Z         D: int,
2025-05-07T20:33:17.9878279Z         scale_ub: Optional[float],
2025-05-07T20:33:17.9878569Z         contiguous: bool,
2025-05-07T20:33:17.9878827Z         compiled: bool,
2025-05-07T20:33:17.9879064Z     ) -> None:
2025-05-07T20:33:17.9879296Z         torch.manual_seed(2025)
2025-05-07T20:33:17.9879555Z     
2025-05-07T20:33:17.9879870Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.9880229Z     
2025-05-07T20:33:17.9880441Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.9880763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.9881086Z         x = x_sign * x_clamp
2025-05-07T20:33:17.9881344Z         x0 = x[:, :D]
2025-05-07T20:33:17.9881583Z         x1 = x[:, D:]
2025-05-07T20:33:17.9881805Z     
2025-05-07T20:33:17.9882028Z         if contiguous:
2025-05-07T20:33:17.9882281Z             x0 = x0.contiguous()
2025-05-07T20:33:17.9882553Z             x1 = x1.contiguous()
2025-05-07T20:33:17.9882817Z     
2025-05-07T20:33:17.9883030Z         if scale_ub is not None:
2025-05-07T20:33:17.9883321Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.9883681Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.9884012Z             )
2025-05-07T20:33:17.9884227Z         else:
2025-05-07T20:33:17.9884451Z             scale_ub_tensor = None
2025-05-07T20:33:17.9884729Z     
2025-05-07T20:33:17.9884981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.9885403Z             op = silu_mul_quant
2025-05-07T20:33:17.9885682Z             if compiled:
2025-05-07T20:33:17.9885948Z                 op = torch.compile(op)
2025-05-07T20:33:17.9886260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9886557Z     
2025-05-07T20:33:17.9886764Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.9887002Z 
2025-05-07T20:33:17.9887157Z moe/activation_test.py:117: 
2025-05-07T20:33:17.9887622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9888020Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.9888320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9888909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.9889502Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.9890306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.9891032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.9891601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.9892323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.9893351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.9893915Z     kernel = self.compile(
2025-05-07T20:33:17.9894492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.9895190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.9895604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9895855Z 
2025-05-07T20:33:17.9896075Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbee0760>
2025-05-07T20:33:17.9897223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.9898689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce5360>}
2025-05-07T20:33:17.9900113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.9901192Z context = <triton._C.libtriton.ir.context object at 0x7efacbe11cf0>
2025-05-07T20:33:17.9901501Z 
2025-05-07T20:33:17.9901682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.9902235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.9902736Z                            module_map=module_map)
2025-05-07T20:33:17.9903117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.9903496Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.9903781Z E       ^
2025-05-07T20:33:17.9904270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.9904750Z 
2025-05-07T20:33:17.9905189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.9905734Z 
2025-05-07T20:33:17.9905845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.9906290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.9906714Z     T=128,
2025-05-07T20:33:17.9906924Z     D=5120,
2025-05-07T20:33:17.9907138Z     scale_ub=1200.0,
2025-05-07T20:33:17.9907378Z     contiguous=False,
2025-05-07T20:33:17.9907668Z     compiled=True,
2025-05-07T20:33:17.9907888Z )
2025-05-07T20:33:18.1063590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.1064433Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.1064845Z 
2025-05-07T20:33:18.1064980Z     @given(
2025-05-07T20:33:18.1065336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.1065792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.1066249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.1066602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.1066943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.1067247Z     )
2025-05-07T20:33:18.1067617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.1068204Z     def test_silu_mul_quant(
2025-05-07T20:33:18.1068460Z         self,
2025-05-07T20:33:18.1068669Z         T: int,
2025-05-07T20:33:18.1068877Z         D: int,
2025-05-07T20:33:18.1069110Z         scale_ub: Optional[float],
2025-05-07T20:33:18.1069403Z         contiguous: bool,
2025-05-07T20:33:18.1069656Z         compiled: bool,
2025-05-07T20:33:18.1069975Z     ) -> None:
2025-05-07T20:33:18.1070204Z         torch.manual_seed(2025)
2025-05-07T20:33:18.1070517Z     
2025-05-07T20:33:18.1070802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.1071163Z     
2025-05-07T20:33:18.1071369Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.1071670Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.1072008Z         x = x_sign * x_clamp
2025-05-07T20:33:18.1072261Z         x0 = x[:, :D]
2025-05-07T20:33:18.1072487Z         x1 = x[:, D:]
2025-05-07T20:33:18.1072710Z     
2025-05-07T20:33:18.1072907Z         if contiguous:
2025-05-07T20:33:18.1073146Z             x0 = x0.contiguous()
2025-05-07T20:33:18.1073422Z             x1 = x1.contiguous()
2025-05-07T20:33:18.1073775Z     
2025-05-07T20:33:18.1073976Z         if scale_ub is not None:
2025-05-07T20:33:18.1074266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.1074620Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.1074942Z             )
2025-05-07T20:33:18.1075149Z         else:
2025-05-07T20:33:18.1075374Z             scale_ub_tensor = None
2025-05-07T20:33:18.1075634Z     
2025-05-07T20:33:18.1075878Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1076211Z             op = silu_mul_quant
2025-05-07T20:33:18.1076471Z             if compiled:
2025-05-07T20:33:18.1076728Z                 op = torch.compile(op)
2025-05-07T20:33:18.1077043Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1077330Z     
2025-05-07T20:33:18.1077531Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.1077707Z 
2025-05-07T20:33:18.1077811Z moe/activation_test.py:117: 
2025-05-07T20:33:18.1078123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1078473Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.1078770Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1079358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.1079950Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.1080633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.1081352Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.1081913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.1082619Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.1083320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.1083950Z     kernel = self.compile(
2025-05-07T20:33:18.1084521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.1085205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.1085623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1085864Z 
2025-05-07T20:33:18.1086090Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbe9b010>
2025-05-07T20:33:18.1087217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.1088650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce6290>}
2025-05-07T20:33:18.1090104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.1091216Z context = <triton._C.libtriton.ir.context object at 0x7efacbb80570>
2025-05-07T20:33:18.1091558Z 
2025-05-07T20:33:18.1091740Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.1092280Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.1092773Z                            module_map=module_map)
2025-05-07T20:33:18.1093204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.1093572Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.1093843Z E       ^
2025-05-07T20:33:18.1094328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.1094796Z 
2025-05-07T20:33:18.1095235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.1095766Z 
2025-05-07T20:33:18.1095882Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.1096315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.1096735Z     T=16384,
2025-05-07T20:33:18.1096942Z     D=7168,
2025-05-07T20:33:18.1097172Z     scale_ub=1200.0,
2025-05-07T20:33:18.1097402Z     contiguous=True,
2025-05-07T20:33:18.1097638Z     compiled=True,
2025-05-07T20:33:18.1097858Z )
2025-05-07T20:33:18.1098188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.1098707Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.1099001Z 
2025-05-07T20:33:18.1099089Z     @given(
2025-05-07T20:33:18.1099329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.1099664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.1099989Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.1100363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.1100817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.1101128Z     )
2025-05-07T20:33:18.1101501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.1101968Z     def test_silu_mul_quant(
2025-05-07T20:33:18.1102228Z         self,
2025-05-07T20:33:18.1102445Z         T: int,
2025-05-07T20:33:18.1102651Z         D: int,
2025-05-07T20:33:18.1102884Z         scale_ub: Optional[float],
2025-05-07T20:33:18.1103173Z         contiguous: bool,
2025-05-07T20:33:18.1103424Z         compiled: bool,
2025-05-07T20:33:18.1103662Z     ) -> None:
2025-05-07T20:33:18.1103891Z         torch.manual_seed(2025)
2025-05-07T20:33:18.1104139Z     
2025-05-07T20:33:18.1104431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.1104850Z     
2025-05-07T20:33:18.1105052Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.1105362Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.1105692Z         x = x_sign * x_clamp
2025-05-07T20:33:18.1105942Z         x0 = x[:, :D]
2025-05-07T20:33:18.1106165Z         x1 = x[:, D:]
2025-05-07T20:33:18.1106395Z     
2025-05-07T20:33:18.1106593Z         if contiguous:
2025-05-07T20:33:18.1106833Z             x0 = x0.contiguous()
2025-05-07T20:33:18.1107106Z             x1 = x1.contiguous()
2025-05-07T20:33:18.1107358Z     
2025-05-07T20:33:18.1107556Z         if scale_ub is not None:
2025-05-07T20:33:18.1108059Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.1108416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.1108789Z             )
2025-05-07T20:33:18.1108992Z         else:
2025-05-07T20:33:18.1109223Z             scale_ub_tensor = None
2025-05-07T20:33:18.1109487Z     
2025-05-07T20:33:18.1109735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1110065Z             op = silu_mul_quant
2025-05-07T20:33:18.1110329Z             if compiled:
2025-05-07T20:33:18.1110590Z                 op = torch.compile(op)
2025-05-07T20:33:18.1110953Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1111365Z     
2025-05-07T20:33:18.1111670Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.1111934Z 
2025-05-07T20:33:18.1112082Z moe/activation_test.py:117: 
2025-05-07T20:33:18.1112400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1112748Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.1113056Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1113745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.1114341Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.1115048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.1115762Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.1116322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.1117042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.1117733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.1118287Z     kernel = self.compile(
2025-05-07T20:33:18.1118849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.1119540Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.1120019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1120452Z 
2025-05-07T20:33:18.1127190Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbbbae60>
2025-05-07T20:33:18.1128356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.1129805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce6d40>}
2025-05-07T20:33:18.1131206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.1132275Z context = <triton._C.libtriton.ir.context object at 0x7efacbb8f770>
2025-05-07T20:33:18.1132583Z 
2025-05-07T20:33:18.1132765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.1133474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.1133972Z                            module_map=module_map)
2025-05-07T20:33:18.1134354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.1134720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.1134998Z E       ^
2025-05-07T20:33:18.1135489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.1135958Z 
2025-05-07T20:33:18.1136392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.1136922Z 
2025-05-07T20:33:18.2499733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.2501074Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.2502574Z     T=16384,
2025-05-07T20:33:18.2503148Z     D=5120,
2025-05-07T20:33:18.2503498Z     scale_ub=1200.0,
2025-05-07T20:33:18.2503736Z     contiguous=True,
2025-05-07T20:33:18.2503968Z     compiled=False,
2025-05-07T20:33:18.2504185Z )
2025-05-07T20:33:18.2504518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.2505198Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.2505500Z 
2025-05-07T20:33:18.2505582Z     @given(
2025-05-07T20:33:18.2505823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.2506144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.2506468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.2506823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.2507166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.2507473Z     )
2025-05-07T20:33:18.2507839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.2508301Z     def test_silu_mul_quant(
2025-05-07T20:33:18.2508561Z         self,
2025-05-07T20:33:18.2508767Z         T: int,
2025-05-07T20:33:18.2508968Z         D: int,
2025-05-07T20:33:18.2509256Z         scale_ub: Optional[float],
2025-05-07T20:33:18.2509541Z         contiguous: bool,
2025-05-07T20:33:18.2509811Z         compiled: bool,
2025-05-07T20:33:18.2510052Z     ) -> None:
2025-05-07T20:33:18.2510275Z         torch.manual_seed(2025)
2025-05-07T20:33:18.2510529Z     
2025-05-07T20:33:18.2510812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.2511164Z     
2025-05-07T20:33:18.2511366Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.2511673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.2512000Z         x = x_sign * x_clamp
2025-05-07T20:33:18.2512248Z         x0 = x[:, :D]
2025-05-07T20:33:18.2512480Z         x1 = x[:, D:]
2025-05-07T20:33:18.2512702Z     
2025-05-07T20:33:18.2512897Z         if contiguous:
2025-05-07T20:33:18.2513139Z             x0 = x0.contiguous()
2025-05-07T20:33:18.2513418Z             x1 = x1.contiguous()
2025-05-07T20:33:18.2513775Z     
2025-05-07T20:33:18.2513978Z         if scale_ub is not None:
2025-05-07T20:33:18.2514271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.2514622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.2514949Z             )
2025-05-07T20:33:18.2515153Z         else:
2025-05-07T20:33:18.2515367Z             scale_ub_tensor = None
2025-05-07T20:33:18.2515630Z     
2025-05-07T20:33:18.2515875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.2516200Z             op = silu_mul_quant
2025-05-07T20:33:18.2516461Z             if compiled:
2025-05-07T20:33:18.2516725Z                 op = torch.compile(op)
2025-05-07T20:33:18.2517029Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.2517320Z     
2025-05-07T20:33:18.2517527Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.2517700Z 
2025-05-07T20:33:18.2517809Z moe/activation_test.py:117: 
2025-05-07T20:33:18.2518219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2518574Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.2518870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.2519603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.2520329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.2520894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.2521620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.2522313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.2522936Z     kernel = self.compile(
2025-05-07T20:33:18.2523591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.2524642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.2525063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2525404Z 
2025-05-07T20:33:18.2525679Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb8083a0>
2025-05-07T20:33:18.2526832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.2528311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce7ac0>}
2025-05-07T20:33:18.2529905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.2530992Z context = <triton._C.libtriton.ir.context object at 0x7efacbbb43f0>
2025-05-07T20:33:18.2531299Z 
2025-05-07T20:33:18.2531483Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.2532033Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.2532524Z                            module_map=module_map)
2025-05-07T20:33:18.2532913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.2533290Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.2533562Z E       ^
2025-05-07T20:33:18.2534059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.2534536Z 
2025-05-07T20:33:18.2534985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.2535528Z 
2025-05-07T20:33:18.2535641Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.2536078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.2536505Z     T=1,
2025-05-07T20:33:18.2536702Z     D=7168,
2025-05-07T20:33:18.2536906Z     scale_ub=1200.0,
2025-05-07T20:33:18.2537145Z     contiguous=False,
2025-05-07T20:33:18.2537384Z     compiled=False,
2025-05-07T20:33:18.2537596Z )
2025-05-07T20:33:18.2537933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.2538453Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:18.2538736Z 
2025-05-07T20:33:18.2538823Z     @given(
2025-05-07T20:33:18.2539063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.2539398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.2539723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.2540204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.2540558Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.2540867Z     )
2025-05-07T20:33:18.2541232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.2541701Z     def test_silu_mul_quant(
2025-05-07T20:33:18.2541959Z         self,
2025-05-07T20:33:18.2542162Z         T: int,
2025-05-07T20:33:18.2542372Z         D: int,
2025-05-07T20:33:18.2542602Z         scale_ub: Optional[float],
2025-05-07T20:33:18.2542897Z         contiguous: bool,
2025-05-07T20:33:18.2543151Z         compiled: bool,
2025-05-07T20:33:18.2543387Z     ) -> None:
2025-05-07T20:33:18.2543613Z         torch.manual_seed(2025)
2025-05-07T20:33:18.2543864Z     
2025-05-07T20:33:18.2544151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.2544579Z     
2025-05-07T20:33:18.2544780Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.2545093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.2545418Z         x = x_sign * x_clamp
2025-05-07T20:33:18.2545666Z         x0 = x[:, :D]
2025-05-07T20:33:18.2545899Z         x1 = x[:, D:]
2025-05-07T20:33:18.2546119Z     
2025-05-07T20:33:18.2546357Z         if contiguous:
2025-05-07T20:33:18.2546636Z             x0 = x0.contiguous()
2025-05-07T20:33:18.2546913Z             x1 = x1.contiguous()
2025-05-07T20:33:18.2547160Z     
2025-05-07T20:33:18.2547364Z         if scale_ub is not None:
2025-05-07T20:33:18.2547657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.2548006Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.2548333Z             )
2025-05-07T20:33:18.2548535Z         else:
2025-05-07T20:33:18.2548757Z             scale_ub_tensor = None
2025-05-07T20:33:18.2549023Z     
2025-05-07T20:33:18.2549265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.2549600Z             op = silu_mul_quant
2025-05-07T20:33:18.2549863Z             if compiled:
2025-05-07T20:33:18.2550127Z                 op = torch.compile(op)
2025-05-07T20:33:18.2550436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.2550719Z     
2025-05-07T20:33:18.2550920Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.2551098Z 
2025-05-07T20:33:18.2551208Z moe/activation_test.py:117: 
2025-05-07T20:33:18.2551519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2551871Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.2552166Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.2552895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.2553691Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.2554254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.2554968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.2555655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.2556207Z     kernel = self.compile(
2025-05-07T20:33:18.2556781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.2557467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.2557875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.2558117Z 
2025-05-07T20:33:18.2558334Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb8807f0>
2025-05-07T20:33:18.2559457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.2560949Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdc4c0>}
2025-05-07T20:33:18.2562350Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.2563419Z context = <triton._C.libtriton.ir.context object at 0x7efacb89bcb0>
2025-05-07T20:33:18.2563725Z 
2025-05-07T20:33:18.2563900Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.2564443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.2564925Z                            module_map=module_map)
2025-05-07T20:33:18.2565355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.2565726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.2565998Z E       ^
2025-05-07T20:33:18.2566476Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.2566945Z 
2025-05-07T20:33:18.2567458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.2567992Z 
2025-05-07T20:33:18.4513006Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4513586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4514039Z     T=4096,
2025-05-07T20:33:18.4514262Z     D=7168,
2025-05-07T20:33:18.4514537Z     scale_ub=1200.0,
2025-05-07T20:33:18.4514878Z     contiguous=False,
2025-05-07T20:33:18.4515201Z     compiled=True,
2025-05-07T20:33:18.4515498Z )
2025-05-07T20:33:18.4515940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4516564Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.4516865Z 
2025-05-07T20:33:18.4516950Z     @given(
2025-05-07T20:33:18.4517202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4517531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4517864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4518220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4518566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4518875Z     )
2025-05-07T20:33:18.4519252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4519725Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4519990Z         self,
2025-05-07T20:33:18.4520200Z         T: int,
2025-05-07T20:33:18.4520412Z         D: int,
2025-05-07T20:33:18.4520644Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4520935Z         contiguous: bool,
2025-05-07T20:33:18.4521193Z         compiled: bool,
2025-05-07T20:33:18.4521434Z     ) -> None:
2025-05-07T20:33:18.4521669Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4521932Z     
2025-05-07T20:33:18.4522223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4522598Z     
2025-05-07T20:33:18.4522813Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4523136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4523512Z         x = x_sign * x_clamp
2025-05-07T20:33:18.4524051Z         x0 = x[:, :D]
2025-05-07T20:33:18.4524390Z         x1 = x[:, D:]
2025-05-07T20:33:18.4524632Z     
2025-05-07T20:33:18.4524840Z         if contiguous:
2025-05-07T20:33:18.4525087Z             x0 = x0.contiguous()
2025-05-07T20:33:18.4525377Z             x1 = x1.contiguous()
2025-05-07T20:33:18.4525656Z     
2025-05-07T20:33:18.4525864Z         if scale_ub is not None:
2025-05-07T20:33:18.4526161Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.4526522Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.4526982Z             )
2025-05-07T20:33:18.4527190Z         else:
2025-05-07T20:33:18.4527418Z             scale_ub_tensor = None
2025-05-07T20:33:18.4527691Z     
2025-05-07T20:33:18.4527935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.4528276Z             op = silu_mul_quant
2025-05-07T20:33:18.4528549Z             if compiled:
2025-05-07T20:33:18.4528817Z                 op = torch.compile(op)
2025-05-07T20:33:18.4529133Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4529429Z     
2025-05-07T20:33:18.4529638Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.4529814Z 
2025-05-07T20:33:18.4529921Z moe/activation_test.py:117: 
2025-05-07T20:33:18.4530243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4530668Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.4530965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4531567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.4532164Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.4532869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.4533742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.4534315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.4535035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.4535740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.4536305Z     kernel = self.compile(
2025-05-07T20:33:18.4536884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.4537581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.4537998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4538245Z 
2025-05-07T20:33:18.4538465Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb888a30>
2025-05-07T20:33:18.4539612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.4541075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdd1b0>}
2025-05-07T20:33:18.4542494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.4543630Z context = <triton._C.libtriton.ir.context object at 0x7efacb8b85f0>
2025-05-07T20:33:18.4543936Z 
2025-05-07T20:33:18.4544111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.4544660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.4545151Z                            module_map=module_map)
2025-05-07T20:33:18.4545538Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.4545906Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.4546181Z E       ^
2025-05-07T20:33:18.4546666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.4547145Z 
2025-05-07T20:33:18.4547580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.4548118Z 
2025-05-07T20:33:18.4548232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4548717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4549134Z     T=128,
2025-05-07T20:33:18.4549336Z     D=7168,
2025-05-07T20:33:18.4549542Z     scale_ub=1200.0,
2025-05-07T20:33:18.4549781Z     contiguous=False,
2025-05-07T20:33:18.4550022Z     compiled=True,
2025-05-07T20:33:18.4550242Z )
2025-05-07T20:33:18.5600748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.5601357Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.5601794Z 
2025-05-07T20:33:18.5601913Z     @given(
2025-05-07T20:33:18.5602163Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.5602493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.5602808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.5603270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.5603620Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.5603918Z     )
2025-05-07T20:33:18.5604291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.5604753Z     def test_silu_mul_quant(
2025-05-07T20:33:18.5605110Z         self,
2025-05-07T20:33:18.5605312Z         T: int,
2025-05-07T20:33:18.5605586Z         D: int,
2025-05-07T20:33:18.5605817Z         scale_ub: Optional[float],
2025-05-07T20:33:18.5606101Z         contiguous: bool,
2025-05-07T20:33:18.5606356Z         compiled: bool,
2025-05-07T20:33:18.5606595Z     ) -> None:
2025-05-07T20:33:18.5606819Z         torch.manual_seed(2025)
2025-05-07T20:33:18.5607076Z     
2025-05-07T20:33:18.5607361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.5607716Z     
2025-05-07T20:33:18.5607918Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.5608230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.5608547Z         x = x_sign * x_clamp
2025-05-07T20:33:18.5608806Z         x0 = x[:, :D]
2025-05-07T20:33:18.5609032Z         x1 = x[:, D:]
2025-05-07T20:33:18.5609246Z     
2025-05-07T20:33:18.5609443Z         if contiguous:
2025-05-07T20:33:18.5609688Z             x0 = x0.contiguous()
2025-05-07T20:33:18.5609966Z             x1 = x1.contiguous()
2025-05-07T20:33:18.5610217Z     
2025-05-07T20:33:18.5610424Z         if scale_ub is not None:
2025-05-07T20:33:18.5610716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.5611060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.5611387Z             )
2025-05-07T20:33:18.5611591Z         else:
2025-05-07T20:33:18.5611808Z             scale_ub_tensor = None
2025-05-07T20:33:18.5612077Z     
2025-05-07T20:33:18.5612322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.5612655Z             op = silu_mul_quant
2025-05-07T20:33:18.5612920Z             if compiled:
2025-05-07T20:33:18.5613181Z                 op = torch.compile(op)
2025-05-07T20:33:18.5613497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5613789Z     
2025-05-07T20:33:18.5613996Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.5614170Z 
2025-05-07T20:33:18.5614274Z moe/activation_test.py:117: 
2025-05-07T20:33:18.5614595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5614942Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.5615239Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5615823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.5616407Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.5617097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.5617812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.5618446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.5619162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.5619855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.5620412Z     kernel = self.compile(
2025-05-07T20:33:18.5620980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.5621663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.5622082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5622321Z 
2025-05-07T20:33:18.5622539Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb865ba0>
2025-05-07T20:33:18.5623718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.5625574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdc0d0>}
2025-05-07T20:33:18.5627121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.5628189Z context = <triton._C.libtriton.ir.context object at 0x7efacb92ecf0>
2025-05-07T20:33:18.5628499Z 
2025-05-07T20:33:18.5628673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.5629221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.5629720Z                            module_map=module_map)
2025-05-07T20:33:18.5630099Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.5630472Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.5630749Z E       ^
2025-05-07T20:33:18.5631230Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.5631706Z 
2025-05-07T20:33:18.5632139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.5632677Z 
2025-05-07T20:33:18.5632786Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.5633219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.5633696Z     T=2048,
2025-05-07T20:33:18.5633899Z     D=7168,
2025-05-07T20:33:18.5634100Z     scale_ub=None,
2025-05-07T20:33:18.5634326Z     contiguous=True,
2025-05-07T20:33:18.5634565Z     compiled=True,
2025-05-07T20:33:18.5634784Z )
2025-05-07T20:33:18.5635112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.5635630Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:18.5635914Z 
2025-05-07T20:33:18.5635996Z     @given(
2025-05-07T20:33:18.5636235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.5636564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.5636893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.5637246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.5637590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.5637893Z     )
2025-05-07T20:33:18.5638260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.5638722Z     def test_silu_mul_quant(
2025-05-07T20:33:18.5638971Z         self,
2025-05-07T20:33:18.5639183Z         T: int,
2025-05-07T20:33:18.5639390Z         D: int,
2025-05-07T20:33:18.5639616Z         scale_ub: Optional[float],
2025-05-07T20:33:18.5639898Z         contiguous: bool,
2025-05-07T20:33:18.5640223Z         compiled: bool,
2025-05-07T20:33:18.5640453Z     ) -> None:
2025-05-07T20:33:18.5646863Z         torch.manual_seed(2025)
2025-05-07T20:33:18.5647161Z     
2025-05-07T20:33:18.5647451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.5647819Z     
2025-05-07T20:33:18.5648035Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.5648340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.5648668Z         x = x_sign * x_clamp
2025-05-07T20:33:18.5648926Z         x0 = x[:, :D]
2025-05-07T20:33:18.5649154Z         x1 = x[:, D:]
2025-05-07T20:33:18.5649377Z     
2025-05-07T20:33:18.5649582Z         if contiguous:
2025-05-07T20:33:18.5649825Z             x0 = x0.contiguous()
2025-05-07T20:33:18.5650101Z             x1 = x1.contiguous()
2025-05-07T20:33:18.5650459Z     
2025-05-07T20:33:18.5650669Z         if scale_ub is not None:
2025-05-07T20:33:18.5650954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.5651322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.5651655Z             )
2025-05-07T20:33:18.5651861Z         else:
2025-05-07T20:33:18.5652091Z             scale_ub_tensor = None
2025-05-07T20:33:18.5652408Z     
2025-05-07T20:33:18.5652658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.5653052Z             op = silu_mul_quant
2025-05-07T20:33:18.5653323Z             if compiled:
2025-05-07T20:33:18.5653585Z                 op = torch.compile(op)
2025-05-07T20:33:18.5653899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5654190Z     
2025-05-07T20:33:18.5654392Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.5654571Z 
2025-05-07T20:33:18.5654679Z moe/activation_test.py:117: 
2025-05-07T20:33:18.5654990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5655345Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.5655637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5656229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.5656820Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.5657507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.5658231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.5658797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.5659517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.5660209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.5660769Z     kernel = self.compile(
2025-05-07T20:33:18.5661337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.5662023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.5662438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5662679Z 
2025-05-07T20:33:18.5662900Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb931b40>
2025-05-07T20:33:18.5664083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.5665518Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbde560>}
2025-05-07T20:33:18.5666972Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.5668040Z context = <triton._C.libtriton.ir.context object at 0x7efacb9d1df0>
2025-05-07T20:33:18.5668347Z 
2025-05-07T20:33:18.5668521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.5669073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.5669562Z                            module_map=module_map)
2025-05-07T20:33:18.5669946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.5670316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.5670588Z E       ^
2025-05-07T20:33:18.5671075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.5671593Z 
2025-05-07T20:33:18.5672029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.5672566Z 
2025-05-07T20:33:18.6466278Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6466724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6467154Z     T=16384,
2025-05-07T20:33:18.6467368Z     D=5120,
2025-05-07T20:33:18.6467805Z     scale_ub=None,
2025-05-07T20:33:18.6468202Z     contiguous=False,
2025-05-07T20:33:18.6468572Z     compiled=False,
2025-05-07T20:33:18.6468875Z )
2025-05-07T20:33:18.6469334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6469941Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.6470233Z 
2025-05-07T20:33:18.6470321Z     @given(
2025-05-07T20:33:18.6470563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6470895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6471225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6471575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6471922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6472226Z     )
2025-05-07T20:33:18.6472600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6473061Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6473320Z         self,
2025-05-07T20:33:18.6473620Z         T: int,
2025-05-07T20:33:18.6473861Z         D: int,
2025-05-07T20:33:18.6474092Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6474375Z         contiguous: bool,
2025-05-07T20:33:18.6474628Z         compiled: bool,
2025-05-07T20:33:18.6474867Z     ) -> None:
2025-05-07T20:33:18.6475089Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6475347Z     
2025-05-07T20:33:18.6475631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6475988Z     
2025-05-07T20:33:18.6476193Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.6476498Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.6478613Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6480568Z 
2025-05-07T20:33:18.6480698Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.6480922Z 
2025-05-07T20:33:18.6481038Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6481471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6481895Z     T=4096,
2025-05-07T20:33:18.6482096Z     D=7168,
2025-05-07T20:33:18.6482294Z     scale_ub=1200.0,
2025-05-07T20:33:18.6482615Z     contiguous=True,
2025-05-07T20:33:18.6482852Z     compiled=True,
2025-05-07T20:33:18.6483065Z )
2025-05-07T20:33:18.6483448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6483965Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.6484245Z 
2025-05-07T20:33:18.6484328Z     @given(
2025-05-07T20:33:18.6484570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6484897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6485220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6485561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6485908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6486272Z     )
2025-05-07T20:33:18.6486639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6487101Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6487361Z         self,
2025-05-07T20:33:18.6487562Z         T: int,
2025-05-07T20:33:18.6487770Z         D: int,
2025-05-07T20:33:18.6487998Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6488280Z         contiguous: bool,
2025-05-07T20:33:18.6488635Z         compiled: bool,
2025-05-07T20:33:18.6488909Z     ) -> None:
2025-05-07T20:33:18.6489133Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6489389Z     
2025-05-07T20:33:18.6489675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6490034Z     
2025-05-07T20:33:18.6490238Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.6490541Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.6492636Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6494583Z 
2025-05-07T20:33:18.6494714Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.6494936Z 
2025-05-07T20:33:18.6495045Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6495480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6495898Z     T=16384,
2025-05-07T20:33:18.6496103Z     D=7168,
2025-05-07T20:33:18.6496302Z     scale_ub=None,
2025-05-07T20:33:18.6496530Z     contiguous=False,
2025-05-07T20:33:18.6496766Z     compiled=False,
2025-05-07T20:33:18.6496979Z )
2025-05-07T20:33:18.6497310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6497831Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.6498123Z 
2025-05-07T20:33:18.6498205Z     @given(
2025-05-07T20:33:18.6498446Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6498774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6499095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6499440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6499786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6500086Z     )
2025-05-07T20:33:18.6500448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6500908Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6501162Z         self,
2025-05-07T20:33:18.6501363Z         T: int,
2025-05-07T20:33:18.6501569Z         D: int,
2025-05-07T20:33:18.6501800Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6502081Z         contiguous: bool,
2025-05-07T20:33:18.6502336Z         compiled: bool,
2025-05-07T20:33:18.6502619Z     ) -> None:
2025-05-07T20:33:18.6502844Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6503099Z     
2025-05-07T20:33:18.6503381Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6505538Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6507525Z 
2025-05-07T20:33:18.6507656Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.6507879Z 
2025-05-07T20:33:18.6507990Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6508422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6508843Z     T=2048,
2025-05-07T20:33:18.6509037Z     D=7168,
2025-05-07T20:33:18.6509286Z     scale_ub=1200.0,
2025-05-07T20:33:18.6509521Z     contiguous=True,
2025-05-07T20:33:18.6509814Z     compiled=True,
2025-05-07T20:33:18.6510028Z )
2025-05-07T20:33:18.6510356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6510865Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.6511154Z 
2025-05-07T20:33:18.6511235Z     @given(
2025-05-07T20:33:18.6511474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6511800Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6512120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6512463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6512813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6513110Z     )
2025-05-07T20:33:18.6513477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6513982Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6514231Z         self,
2025-05-07T20:33:18.6514444Z         T: int,
2025-05-07T20:33:18.6514653Z         D: int,
2025-05-07T20:33:18.6514877Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6515160Z         contiguous: bool,
2025-05-07T20:33:18.6515414Z         compiled: bool,
2025-05-07T20:33:18.6515643Z     ) -> None:
2025-05-07T20:33:18.6515872Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6516125Z     
2025-05-07T20:33:18.6516404Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6516767Z     
2025-05-07T20:33:18.6516977Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.6517277Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.6519361Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6521298Z 
2025-05-07T20:33:18.6521421Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.6521649Z 
2025-05-07T20:33:18.6521757Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6522186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6522601Z     T=2048,
2025-05-07T20:33:18.6522800Z     D=7168,
2025-05-07T20:33:18.6523016Z     scale_ub=None,
2025-05-07T20:33:18.6523273Z     contiguous=True,
2025-05-07T20:33:18.6523556Z     compiled=False,
2025-05-07T20:33:18.6523952Z )
2025-05-07T20:33:18.9573447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.9574473Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.9574998Z 
2025-05-07T20:33:18.9575159Z     @given(
2025-05-07T20:33:18.9575435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.9575776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.9576109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.9576469Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.9576833Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.9577143Z     )
2025-05-07T20:33:18.9577657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.9578133Z     def test_silu_mul_quant(
2025-05-07T20:33:18.9578400Z         self,
2025-05-07T20:33:18.9578628Z         T: int,
2025-05-07T20:33:18.9578841Z         D: int,
2025-05-07T20:33:18.9579081Z         scale_ub: Optional[float],
2025-05-07T20:33:18.9579378Z         contiguous: bool,
2025-05-07T20:33:18.9579635Z         compiled: bool,
2025-05-07T20:33:18.9579955Z     ) -> None:
2025-05-07T20:33:18.9580257Z         torch.manual_seed(2025)
2025-05-07T20:33:18.9580517Z     
2025-05-07T20:33:18.9580816Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.9581189Z     
2025-05-07T20:33:18.9581398Z >       x_sign = torch.sign(x)
2025-05-07T20:33:18.9583545Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.9585543Z 
2025-05-07T20:33:18.9585673Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:18.9585912Z 
2025-05-07T20:33:18.9586027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.9586474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.9586904Z     T=1,
2025-05-07T20:33:18.9587110Z     D=7168,
2025-05-07T20:33:18.9587320Z     scale_ub=1200.0,
2025-05-07T20:33:18.9587560Z     contiguous=True,
2025-05-07T20:33:18.9587802Z     compiled=False,
2025-05-07T20:33:18.9588024Z )
2025-05-07T20:33:18.9588369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.9588894Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.9589185Z 
2025-05-07T20:33:18.9589270Z     @given(
2025-05-07T20:33:18.9589521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.9589855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.9590185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.9590547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.9590902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.9591212Z     )
2025-05-07T20:33:18.9591592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.9592067Z     def test_silu_mul_quant(
2025-05-07T20:33:18.9592332Z         self,
2025-05-07T20:33:18.9592545Z         T: int,
2025-05-07T20:33:18.9592754Z         D: int,
2025-05-07T20:33:18.9592998Z         scale_ub: Optional[float],
2025-05-07T20:33:18.9593296Z         contiguous: bool,
2025-05-07T20:33:18.9593630Z         compiled: bool,
2025-05-07T20:33:18.9593871Z     ) -> None:
2025-05-07T20:33:18.9594108Z         torch.manual_seed(2025)
2025-05-07T20:33:18.9594368Z     
2025-05-07T20:33:18.9594732Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.9595103Z     
2025-05-07T20:33:18.9595316Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.9595626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.9595969Z         x = x_sign * x_clamp
2025-05-07T20:33:18.9596232Z         x0 = x[:, :D]
2025-05-07T20:33:18.9596464Z         x1 = x[:, D:]
2025-05-07T20:33:18.9596694Z     
2025-05-07T20:33:18.9596897Z         if contiguous:
2025-05-07T20:33:18.9597147Z             x0 = x0.contiguous()
2025-05-07T20:33:18.9597428Z             x1 = x1.contiguous()
2025-05-07T20:33:18.9597691Z     
2025-05-07T20:33:18.9597895Z         if scale_ub is not None:
2025-05-07T20:33:18.9598196Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.9598609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.9598947Z             )
2025-05-07T20:33:18.9599155Z         else:
2025-05-07T20:33:18.9599391Z             scale_ub_tensor = None
2025-05-07T20:33:18.9599665Z     
2025-05-07T20:33:18.9599914Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.9600249Z             op = silu_mul_quant
2025-05-07T20:33:18.9600566Z             if compiled:
2025-05-07T20:33:18.9600873Z                 op = torch.compile(op)
2025-05-07T20:33:18.9601197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9601498Z     
2025-05-07T20:33:18.9601704Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.9601888Z 
2025-05-07T20:33:18.9601996Z moe/activation_test.py:117: 
2025-05-07T20:33:18.9602318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9602672Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.9602983Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9603790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.9604540Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.9605117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.9605855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.9606580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.9607160Z     kernel = self.compile(
2025-05-07T20:33:18.9607743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.9608452Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.9608880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9609129Z 
2025-05-07T20:33:18.9609354Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb60eda0>
2025-05-07T20:33:18.9610522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.9612010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d44c0>}
2025-05-07T20:33:18.9613457Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.9614559Z context = <triton._C.libtriton.ir.context object at 0x7efacb6b8670>
2025-05-07T20:33:18.9614871Z 
2025-05-07T20:33:18.9615055Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.9615624Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.9616178Z                            module_map=module_map)
2025-05-07T20:33:18.9616569Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.9616952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.9617239Z E       ^
2025-05-07T20:33:18.9617744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.9618227Z 
2025-05-07T20:33:18.9618673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.9619225Z 
2025-05-07T20:33:18.9619339Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.9619788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.9620271Z     T=128,
2025-05-07T20:33:18.9620472Z     D=5120,
2025-05-07T20:33:18.9620685Z     scale_ub=None,
2025-05-07T20:33:18.9620917Z     contiguous=True,
2025-05-07T20:33:18.9621160Z     compiled=False,
2025-05-07T20:33:18.9621384Z )
2025-05-07T20:33:19.0430078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.0430823Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.0431386Z 
2025-05-07T20:33:19.0431580Z     @given(
2025-05-07T20:33:19.0431972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.0432457Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.0432918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.0433360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.0433810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.0434118Z     )
2025-05-07T20:33:19.0434495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.0434977Z     def test_silu_mul_quant(
2025-05-07T20:33:19.0435242Z         self,
2025-05-07T20:33:19.0435452Z         T: int,
2025-05-07T20:33:19.0435668Z         D: int,
2025-05-07T20:33:19.0435905Z         scale_ub: Optional[float],
2025-05-07T20:33:19.0436198Z         contiguous: bool,
2025-05-07T20:33:19.0436458Z         compiled: bool,
2025-05-07T20:33:19.0436703Z     ) -> None:
2025-05-07T20:33:19.0436938Z         torch.manual_seed(2025)
2025-05-07T20:33:19.0437200Z     
2025-05-07T20:33:19.0437494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.0437867Z     
2025-05-07T20:33:19.0438076Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0438394Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0438733Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0438987Z         x0 = x[:, :D]
2025-05-07T20:33:19.0439225Z         x1 = x[:, D:]
2025-05-07T20:33:19.0439453Z     
2025-05-07T20:33:19.0439655Z         if contiguous:
2025-05-07T20:33:19.0439909Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0440189Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0440455Z     
2025-05-07T20:33:19.0440666Z         if scale_ub is not None:
2025-05-07T20:33:19.0440963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0441328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0441661Z             )
2025-05-07T20:33:19.0441872Z         else:
2025-05-07T20:33:19.0442106Z             scale_ub_tensor = None
2025-05-07T20:33:19.0442374Z     
2025-05-07T20:33:19.0442625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0442968Z             op = silu_mul_quant
2025-05-07T20:33:19.0443236Z             if compiled:
2025-05-07T20:33:19.0443504Z                 op = torch.compile(op)
2025-05-07T20:33:19.0443835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0444134Z     
2025-05-07T20:33:19.0444345Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0444529Z 
2025-05-07T20:33:19.0444635Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0445037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0445393Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0445696Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0446446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0447194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0447769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0448505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0455516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0456151Z     kernel = self.compile(
2025-05-07T20:33:19.0456851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0457570Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0458003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0458252Z 
2025-05-07T20:33:19.0458483Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb639930>
2025-05-07T20:33:19.0459731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0461217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d4940>}
2025-05-07T20:33:19.0462661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0463775Z context = <triton._C.libtriton.ir.context object at 0x7efacb7bbb30>
2025-05-07T20:33:19.0464088Z 
2025-05-07T20:33:19.0464272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0464835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0465343Z                            module_map=module_map)
2025-05-07T20:33:19.0465743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0466128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0466417Z E       ^
2025-05-07T20:33:19.0466923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0467410Z 
2025-05-07T20:33:19.0467863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0468413Z 
2025-05-07T20:33:19.0468530Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.0468976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.0469412Z     T=128,
2025-05-07T20:33:19.0469613Z     D=7168,
2025-05-07T20:33:19.0469827Z     scale_ub=None,
2025-05-07T20:33:19.0470061Z     contiguous=True,
2025-05-07T20:33:19.0470303Z     compiled=False,
2025-05-07T20:33:19.0470529Z )
2025-05-07T20:33:19.0470878Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.0471408Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.0471698Z 
2025-05-07T20:33:19.0471784Z     @given(
2025-05-07T20:33:19.0472036Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.0472378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.0472710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.0473069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.0473480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.0473929Z     )
2025-05-07T20:33:19.0474308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.0474784Z     def test_silu_mul_quant(
2025-05-07T20:33:19.0475055Z         self,
2025-05-07T20:33:19.0475269Z         T: int,
2025-05-07T20:33:19.0475485Z         D: int,
2025-05-07T20:33:19.0475728Z         scale_ub: Optional[float],
2025-05-07T20:33:19.0476019Z         contiguous: bool,
2025-05-07T20:33:19.0476280Z         compiled: bool,
2025-05-07T20:33:19.0476523Z     ) -> None:
2025-05-07T20:33:19.0476756Z         torch.manual_seed(2025)
2025-05-07T20:33:19.0477016Z     
2025-05-07T20:33:19.0477310Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.0477730Z     
2025-05-07T20:33:19.0477945Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.0478266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.0478600Z         x = x_sign * x_clamp
2025-05-07T20:33:19.0478865Z         x0 = x[:, :D]
2025-05-07T20:33:19.0479101Z         x1 = x[:, D:]
2025-05-07T20:33:19.0479326Z     
2025-05-07T20:33:19.0479528Z         if contiguous:
2025-05-07T20:33:19.0479783Z             x0 = x0.contiguous()
2025-05-07T20:33:19.0480113Z             x1 = x1.contiguous()
2025-05-07T20:33:19.0480422Z     
2025-05-07T20:33:19.0480638Z         if scale_ub is not None:
2025-05-07T20:33:19.0480940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.0481298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.0481639Z             )
2025-05-07T20:33:19.0481851Z         else:
2025-05-07T20:33:19.0482078Z             scale_ub_tensor = None
2025-05-07T20:33:19.0482351Z     
2025-05-07T20:33:19.0482607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.0482949Z             op = silu_mul_quant
2025-05-07T20:33:19.0483224Z             if compiled:
2025-05-07T20:33:19.0483521Z                 op = torch.compile(op)
2025-05-07T20:33:19.0483865Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0484167Z     
2025-05-07T20:33:19.0484376Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.0484556Z 
2025-05-07T20:33:19.0484667Z moe/activation_test.py:117: 
2025-05-07T20:33:19.0484990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0485350Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.0485657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.0486397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.0487147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.0487725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.0488464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.0489199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.0489777Z     kernel = self.compile(
2025-05-07T20:33:19.0490363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.0491069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.0491498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.0491750Z 
2025-05-07T20:33:19.0491976Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb6383d0>
2025-05-07T20:33:19.0493140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.0494679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d5240>}
2025-05-07T20:33:19.0496124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.0497227Z context = <triton._C.libtriton.ir.context object at 0x7efacb74f230>
2025-05-07T20:33:19.0497540Z 
2025-05-07T20:33:19.0497723Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.0498283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.0498782Z                            module_map=module_map)
2025-05-07T20:33:19.0499175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.0499602Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.0499879Z E       ^
2025-05-07T20:33:19.0500387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.0500871Z 
2025-05-07T20:33:19.0501324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.0502032Z 
2025-05-07T20:33:19.0502192Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.0502639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.0503073Z     T=2048,
2025-05-07T20:33:19.0503282Z     D=7168,
2025-05-07T20:33:19.0503523Z     scale_ub=1200.0,
2025-05-07T20:33:19.0503778Z     contiguous=True,
2025-05-07T20:33:19.0504021Z     compiled=False,
2025-05-07T20:33:19.0504240Z )
2025-05-07T20:33:19.1485373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1486214Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.1486638Z 
2025-05-07T20:33:19.1486736Z     @given(
2025-05-07T20:33:19.1487001Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1487336Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1487666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1488027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1488382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1488696Z     )
2025-05-07T20:33:19.1489078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1489551Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1489815Z         self,
2025-05-07T20:33:19.1490029Z         T: int,
2025-05-07T20:33:19.1490238Z         D: int,
2025-05-07T20:33:19.1490478Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1490771Z         contiguous: bool,
2025-05-07T20:33:19.1491034Z         compiled: bool,
2025-05-07T20:33:19.1491272Z     ) -> None:
2025-05-07T20:33:19.1491507Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1491775Z     
2025-05-07T20:33:19.1492063Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1494297Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1496333Z 
2025-05-07T20:33:19.1496462Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.1496697Z 
2025-05-07T20:33:19.1496808Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1497253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1497839Z     T=1,
2025-05-07T20:33:19.1498044Z     D=5120,
2025-05-07T20:33:19.1498258Z     scale_ub=1200.0,
2025-05-07T20:33:19.1498496Z     contiguous=True,
2025-05-07T20:33:19.1498738Z     compiled=False,
2025-05-07T20:33:19.1498963Z )
2025-05-07T20:33:19.1499307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1499820Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.1500108Z 
2025-05-07T20:33:19.1500195Z     @given(
2025-05-07T20:33:19.1500442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1500773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1501098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1501454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1501869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1502177Z     )
2025-05-07T20:33:19.1502555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1503029Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1503290Z         self,
2025-05-07T20:33:19.1503503Z         T: int,
2025-05-07T20:33:19.1503789Z         D: int,
2025-05-07T20:33:19.1504021Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1504369Z         contiguous: bool,
2025-05-07T20:33:19.1504632Z         compiled: bool,
2025-05-07T20:33:19.1504869Z     ) -> None:
2025-05-07T20:33:19.1505103Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1505363Z     
2025-05-07T20:33:19.1505651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1506019Z     
2025-05-07T20:33:19.1506228Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.1506534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.1506868Z         x = x_sign * x_clamp
2025-05-07T20:33:19.1507123Z         x0 = x[:, :D]
2025-05-07T20:33:19.1507351Z         x1 = x[:, D:]
2025-05-07T20:33:19.1507580Z     
2025-05-07T20:33:19.1507786Z         if contiguous:
2025-05-07T20:33:19.1508030Z             x0 = x0.contiguous()
2025-05-07T20:33:19.1508306Z             x1 = x1.contiguous()
2025-05-07T20:33:19.1508565Z     
2025-05-07T20:33:19.1508778Z         if scale_ub is not None:
2025-05-07T20:33:19.1509076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.1509432Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.1509759Z             )
2025-05-07T20:33:19.1509963Z         else:
2025-05-07T20:33:19.1510183Z             scale_ub_tensor = None
2025-05-07T20:33:19.1510451Z     
2025-05-07T20:33:19.1510699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.1511033Z             op = silu_mul_quant
2025-05-07T20:33:19.1511305Z             if compiled:
2025-05-07T20:33:19.1511586Z                 op = torch.compile(op)
2025-05-07T20:33:19.1511905Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.1512200Z     
2025-05-07T20:33:19.1512404Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.1512584Z 
2025-05-07T20:33:19.1512690Z moe/activation_test.py:117: 
2025-05-07T20:33:19.1513007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.1513365Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.1513813Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.1514547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.1515280Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.1515844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.1516568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.1517274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.1517884Z     kernel = self.compile(
2025-05-07T20:33:19.1518464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.1519161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.1519582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.1519824Z 
2025-05-07T20:33:19.1520045Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb679a50>
2025-05-07T20:33:19.1521188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.1522684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d6200>}
2025-05-07T20:33:19.1524347Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.1525502Z context = <triton._C.libtriton.ir.context object at 0x7efacb745af0>
2025-05-07T20:33:19.1525904Z 
2025-05-07T20:33:19.1526082Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.1526631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.1527129Z                            module_map=module_map)
2025-05-07T20:33:19.1527511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.1527884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.1528160Z E       ^
2025-05-07T20:33:19.1528646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.1529126Z 
2025-05-07T20:33:19.1529566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.1530112Z 
2025-05-07T20:33:19.1530220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1530664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1531082Z     T=2048,
2025-05-07T20:33:19.1531281Z     D=5120,
2025-05-07T20:33:19.1531486Z     scale_ub=None,
2025-05-07T20:33:19.1531708Z     contiguous=True,
2025-05-07T20:33:19.1531945Z     compiled=False,
2025-05-07T20:33:19.1532163Z )
2025-05-07T20:33:19.1532496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1533018Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.1533309Z 
2025-05-07T20:33:19.1533391Z     @given(
2025-05-07T20:33:19.1533634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1533963Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1534291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1534642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1534985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1535297Z     )
2025-05-07T20:33:19.1535677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1536144Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1536394Z         self,
2025-05-07T20:33:19.1536603Z         T: int,
2025-05-07T20:33:19.1536809Z         D: int,
2025-05-07T20:33:19.1537035Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1537320Z         contiguous: bool,
2025-05-07T20:33:19.1537576Z         compiled: bool,
2025-05-07T20:33:19.1537807Z     ) -> None:
2025-05-07T20:33:19.1538039Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1538294Z     
2025-05-07T20:33:19.1538576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1539008Z     
2025-05-07T20:33:19.1539214Z >       x_sign = torch.sign(x)
2025-05-07T20:33:19.1541266Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1543217Z 
2025-05-07T20:33:19.1543342Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:19.1543570Z 
2025-05-07T20:33:19.1543743Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1544175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1544596Z     T=16384,
2025-05-07T20:33:19.1544800Z     D=5120,
2025-05-07T20:33:19.1545003Z     scale_ub=None,
2025-05-07T20:33:19.1545228Z     contiguous=True,
2025-05-07T20:33:19.1545457Z     compiled=False,
2025-05-07T20:33:19.1545676Z )
2025-05-07T20:33:19.2535362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2536965Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2537762Z 
2025-05-07T20:33:19.2537934Z     @given(
2025-05-07T20:33:19.2538403Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2539049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2539681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2540363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2541051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2541639Z     )
2025-05-07T20:33:19.2542366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2543264Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2543563Z         self,
2025-05-07T20:33:19.2543791Z         T: int,
2025-05-07T20:33:19.2543996Z         D: int,
2025-05-07T20:33:19.2544229Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2544515Z         contiguous: bool,
2025-05-07T20:33:19.2544762Z         compiled: bool,
2025-05-07T20:33:19.2544999Z     ) -> None:
2025-05-07T20:33:19.2545224Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2545471Z     
2025-05-07T20:33:19.2545756Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2547877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2549821Z 
2025-05-07T20:33:19.2549944Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2550168Z 
2025-05-07T20:33:19.2550284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2550710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2551124Z     T=4096,
2025-05-07T20:33:19.2551323Z     D=5120,
2025-05-07T20:33:19.2551519Z     scale_ub=None,
2025-05-07T20:33:19.2551738Z     contiguous=True,
2025-05-07T20:33:19.2551972Z     compiled=False,
2025-05-07T20:33:19.2552181Z )
2025-05-07T20:33:19.2552518Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2553034Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2553314Z 
2025-05-07T20:33:19.2553628Z     @given(
2025-05-07T20:33:19.2553869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2554194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2554514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2554858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2555199Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2555497Z     )
2025-05-07T20:33:19.2555860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2556316Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2556569Z         self,
2025-05-07T20:33:19.2556769Z         T: int,
2025-05-07T20:33:19.2556974Z         D: int,
2025-05-07T20:33:19.2557198Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2557549Z         contiguous: bool,
2025-05-07T20:33:19.2557795Z         compiled: bool,
2025-05-07T20:33:19.2558027Z     ) -> None:
2025-05-07T20:33:19.2558253Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2558499Z     
2025-05-07T20:33:19.2558779Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2560947Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2562932Z 
2025-05-07T20:33:19.2563061Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2563291Z 
2025-05-07T20:33:19.2563414Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2563873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2564292Z     T=2048,
2025-05-07T20:33:19.2564486Z     D=5120,
2025-05-07T20:33:19.2564683Z     scale_ub=None,
2025-05-07T20:33:19.2564907Z     contiguous=False,
2025-05-07T20:33:19.2565146Z     compiled=False,
2025-05-07T20:33:19.2565354Z )
2025-05-07T20:33:19.2565685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2566201Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2566486Z 
2025-05-07T20:33:19.2566566Z     @given(
2025-05-07T20:33:19.2566804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2567130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2567446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2567794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2568140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2568440Z     )
2025-05-07T20:33:19.2568801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2569265Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2569518Z         self,
2025-05-07T20:33:19.2569716Z         T: int,
2025-05-07T20:33:19.2569926Z         D: int,
2025-05-07T20:33:19.2570156Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2570433Z         contiguous: bool,
2025-05-07T20:33:19.2570681Z         compiled: bool,
2025-05-07T20:33:19.2570911Z     ) -> None:
2025-05-07T20:33:19.2571134Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2571388Z     
2025-05-07T20:33:19.2571667Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2573931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2575880Z 
2025-05-07T20:33:19.2576010Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2576232Z 
2025-05-07T20:33:19.2576341Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2576773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2577190Z     T=4096,
2025-05-07T20:33:19.2577382Z     D=7168,
2025-05-07T20:33:19.2577580Z     scale_ub=None,
2025-05-07T20:33:19.2577806Z     contiguous=True,
2025-05-07T20:33:19.2578035Z     compiled=True,
2025-05-07T20:33:19.2578295Z )
2025-05-07T20:33:19.2578626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2579134Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2579418Z 
2025-05-07T20:33:19.2579498Z     @given(
2025-05-07T20:33:19.2579736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2580063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2580458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2580804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2581151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2581446Z     )
2025-05-07T20:33:19.2581811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2582273Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2582522Z         self,
2025-05-07T20:33:19.2582726Z         T: int,
2025-05-07T20:33:19.2582935Z         D: int,
2025-05-07T20:33:19.2583159Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2583447Z         contiguous: bool,
2025-05-07T20:33:19.2583724Z         compiled: bool,
2025-05-07T20:33:19.2583956Z     ) -> None:
2025-05-07T20:33:19.2584178Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2584432Z     
2025-05-07T20:33:19.2584715Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2586842Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2588777Z 
2025-05-07T20:33:19.2588902Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2589122Z 
2025-05-07T20:33:19.2589228Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2589658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2590075Z     T=2048,
2025-05-07T20:33:19.2590265Z     D=5120,
2025-05-07T20:33:19.2590459Z     scale_ub=1200.0,
2025-05-07T20:33:19.2590693Z     contiguous=False,
2025-05-07T20:33:19.2590929Z     compiled=False,
2025-05-07T20:33:19.2591138Z )
2025-05-07T20:33:19.2598571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2599109Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2599402Z 
2025-05-07T20:33:19.2599484Z     @given(
2025-05-07T20:33:19.2599729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2600056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2600380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2600725Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2601141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2601439Z     )
2025-05-07T20:33:19.2601806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2602270Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2602523Z         self,
2025-05-07T20:33:19.2602726Z         T: int,
2025-05-07T20:33:19.2602934Z         D: int,
2025-05-07T20:33:19.2603165Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2603492Z         contiguous: bool,
2025-05-07T20:33:19.2603740Z         compiled: bool,
2025-05-07T20:33:19.2603974Z     ) -> None:
2025-05-07T20:33:19.2604194Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2604446Z     
2025-05-07T20:33:19.2604730Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2606933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2608959Z 
2025-05-07T20:33:19.2609085Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2609309Z 
2025-05-07T20:33:19.2609416Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2609846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2610261Z     T=4096,
2025-05-07T20:33:19.2610449Z     D=7168,
2025-05-07T20:33:19.2610656Z     scale_ub=1200.0,
2025-05-07T20:33:19.2610890Z     contiguous=True,
2025-05-07T20:33:19.2611117Z     compiled=False,
2025-05-07T20:33:19.2611328Z )
2025-05-07T20:33:19.3889445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3890248Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.3890647Z 
2025-05-07T20:33:19.3890765Z     @given(
2025-05-07T20:33:19.3891079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3891416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3891740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3892090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3892436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3892741Z     )
2025-05-07T20:33:19.3893115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3893629Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3893889Z         self,
2025-05-07T20:33:19.3894097Z         T: int,
2025-05-07T20:33:19.3894304Z         D: int,
2025-05-07T20:33:19.3894537Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3894829Z         contiguous: bool,
2025-05-07T20:33:19.3895081Z         compiled: bool,
2025-05-07T20:33:19.3895322Z     ) -> None:
2025-05-07T20:33:19.3895552Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3895809Z     
2025-05-07T20:33:19.3896095Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3898243Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3900186Z 
2025-05-07T20:33:19.3900315Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3900652Z 
2025-05-07T20:33:19.3900770Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3901199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3901622Z     T=16384,
2025-05-07T20:33:19.3901832Z     D=7168,
2025-05-07T20:33:19.3902048Z     scale_ub=None,
2025-05-07T20:33:19.3902277Z     contiguous=False,
2025-05-07T20:33:19.3902519Z     compiled=True,
2025-05-07T20:33:19.3902736Z )
2025-05-07T20:33:19.3903072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3903588Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.3903879Z 
2025-05-07T20:33:19.3903968Z     @given(
2025-05-07T20:33:19.3904207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3904610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3904934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3905282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3905631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3905936Z     )
2025-05-07T20:33:19.3906302Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3906830Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3907172Z         self,
2025-05-07T20:33:19.3907382Z         T: int,
2025-05-07T20:33:19.3907592Z         D: int,
2025-05-07T20:33:19.3907824Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3908112Z         contiguous: bool,
2025-05-07T20:33:19.3908362Z         compiled: bool,
2025-05-07T20:33:19.3908600Z     ) -> None:
2025-05-07T20:33:19.3908830Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3909084Z     
2025-05-07T20:33:19.3909371Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3911503Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3913437Z 
2025-05-07T20:33:19.3913658Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3913882Z 
2025-05-07T20:33:19.3913995Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3914423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3914842Z     T=4096,
2025-05-07T20:33:19.3915042Z     D=7168,
2025-05-07T20:33:19.3915240Z     scale_ub=None,
2025-05-07T20:33:19.3915466Z     contiguous=True,
2025-05-07T20:33:19.3915700Z     compiled=False,
2025-05-07T20:33:19.3915910Z )
2025-05-07T20:33:19.3916246Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3916761Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.3917043Z 
2025-05-07T20:33:19.3917130Z     @given(
2025-05-07T20:33:19.3917369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3917696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3918014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3918354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3918698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3918996Z     )
2025-05-07T20:33:19.3919358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3919822Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3920075Z         self,
2025-05-07T20:33:19.3920277Z         T: int,
2025-05-07T20:33:19.3920485Z         D: int,
2025-05-07T20:33:19.3920766Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3921050Z         contiguous: bool,
2025-05-07T20:33:19.3921302Z         compiled: bool,
2025-05-07T20:33:19.3921534Z     ) -> None:
2025-05-07T20:33:19.3921759Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3922015Z     
2025-05-07T20:33:19.3922299Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3924658Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3926655Z 
2025-05-07T20:33:19.3926788Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3927011Z 
2025-05-07T20:33:19.3927121Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3927554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3928035Z     T=16384,
2025-05-07T20:33:19.3928293Z     D=7168,
2025-05-07T20:33:19.3928494Z     scale_ub=None,
2025-05-07T20:33:19.3928720Z     contiguous=True,
2025-05-07T20:33:19.3928955Z     compiled=False,
2025-05-07T20:33:19.3929166Z )
2025-05-07T20:33:19.3929496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3930012Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.3930300Z 
2025-05-07T20:33:19.3930384Z     @given(
2025-05-07T20:33:19.3930627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3930953Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3931271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3931616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3931962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3932264Z     )
2025-05-07T20:33:19.3932633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3933097Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3933349Z         self,
2025-05-07T20:33:19.3933549Z         T: int,
2025-05-07T20:33:19.3933757Z         D: int,
2025-05-07T20:33:19.3933986Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3934267Z         contiguous: bool,
2025-05-07T20:33:19.3934518Z         compiled: bool,
2025-05-07T20:33:19.3934750Z     ) -> None:
2025-05-07T20:33:19.3934972Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3935227Z     
2025-05-07T20:33:19.3935513Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3937640Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3939573Z 
2025-05-07T20:33:19.3939703Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3939927Z 
2025-05-07T20:33:19.3940035Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3940466Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3940887Z     T=16384,
2025-05-07T20:33:19.3941087Z     D=7168,
2025-05-07T20:33:19.3941289Z     scale_ub=1200.0,
2025-05-07T20:33:19.3941524Z     contiguous=True,
2025-05-07T20:33:19.3941821Z     compiled=False,
2025-05-07T20:33:19.3942038Z )
2025-05-07T20:33:19.3942368Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3942883Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.3943180Z 
2025-05-07T20:33:19.3943269Z     @given(
2025-05-07T20:33:19.3943546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3943885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3944202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3944545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3944886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3945185Z     )
2025-05-07T20:33:19.3945552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3946061Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3946310Z         self,
2025-05-07T20:33:19.3946518Z         T: int,
2025-05-07T20:33:19.3946726Z         D: int,
2025-05-07T20:33:19.3946951Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3947235Z         contiguous: bool,
2025-05-07T20:33:19.3947490Z         compiled: bool,
2025-05-07T20:33:19.3947796Z     ) -> None:
2025-05-07T20:33:19.3948058Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3948312Z     
2025-05-07T20:33:19.3948596Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3950731Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3952665Z 
2025-05-07T20:33:19.3952790Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3953017Z 
2025-05-07T20:33:19.3953127Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3953653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3954086Z     T=128,
2025-05-07T20:33:19.3954294Z     D=5120,
2025-05-07T20:33:19.3954495Z     scale_ub=1200.0,
2025-05-07T20:33:19.3954725Z     contiguous=False,
2025-05-07T20:33:19.3954961Z     compiled=False,
2025-05-07T20:33:19.3955174Z )
2025-05-07T20:33:19.5386691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.5388280Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.5389088Z 
2025-05-07T20:33:19.5389328Z     @given(
2025-05-07T20:33:19.5389855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.5390510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.5391134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.5391818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.5392498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.5393088Z     )
2025-05-07T20:33:19.5393711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.5394188Z     def test_silu_mul_quant(
2025-05-07T20:33:19.5394437Z         self,
2025-05-07T20:33:19.5394642Z         T: int,
2025-05-07T20:33:19.5394850Z         D: int,
2025-05-07T20:33:19.5395077Z         scale_ub: Optional[float],
2025-05-07T20:33:19.5395366Z         contiguous: bool,
2025-05-07T20:33:19.5395619Z         compiled: bool,
2025-05-07T20:33:19.5395848Z     ) -> None:
2025-05-07T20:33:19.5396080Z         torch.manual_seed(2025)
2025-05-07T20:33:19.5396338Z     
2025-05-07T20:33:19.5396622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.5397091Z     
2025-05-07T20:33:19.5397299Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.5397606Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.5397926Z         x = x_sign * x_clamp
2025-05-07T20:33:19.5398194Z         x0 = x[:, :D]
2025-05-07T20:33:19.5398431Z         x1 = x[:, D:]
2025-05-07T20:33:19.5398651Z     
2025-05-07T20:33:19.5398851Z         if contiguous:
2025-05-07T20:33:19.5399105Z             x0 = x0.contiguous()
2025-05-07T20:33:19.5399395Z             x1 = x1.contiguous()
2025-05-07T20:33:19.5399665Z     
2025-05-07T20:33:19.5399871Z         if scale_ub is not None:
2025-05-07T20:33:19.5400173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.5400552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.5400969Z             )
2025-05-07T20:33:19.5401174Z         else:
2025-05-07T20:33:19.5401407Z             scale_ub_tensor = None
2025-05-07T20:33:19.5401696Z     
2025-05-07T20:33:19.5401952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.5402311Z             op = silu_mul_quant
2025-05-07T20:33:19.5402589Z             if compiled:
2025-05-07T20:33:19.5402862Z                 op = torch.compile(op)
2025-05-07T20:33:19.5403261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.5403629Z     
2025-05-07T20:33:19.5403840Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.5404027Z 
2025-05-07T20:33:19.5404134Z moe/activation_test.py:117: 
2025-05-07T20:33:19.5404471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.5404856Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.5405167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.5406020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.5406881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.5407530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.5408365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.5409182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.5409833Z     kernel = self.compile(
2025-05-07T20:33:19.5410490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.5411289Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.5411755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.5412029Z 
2025-05-07T20:33:19.5412277Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbaf99f0>
2025-05-07T20:33:19.5413643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.5415418Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbac5ea0>}
2025-05-07T20:33:19.5417142Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.5418436Z context = <triton._C.libtriton.ir.context object at 0x7efacb5deff0>
2025-05-07T20:33:19.5418784Z 
2025-05-07T20:33:19.5418974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.5419597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.5420164Z                            module_map=module_map)
2025-05-07T20:33:19.5420633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.5421038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.5421333Z E       ^
2025-05-07T20:33:19.5421892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.5422458Z 
2025-05-07T20:33:19.5422983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.5423623Z 
2025-05-07T20:33:19.5423735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.5424338Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.5424757Z     T=2048,
2025-05-07T20:33:19.5424952Z     D=7168,
2025-05-07T20:33:19.5425230Z     scale_ub=None,
2025-05-07T20:33:19.5425463Z     contiguous=False,
2025-05-07T20:33:19.5425704Z     compiled=False,
2025-05-07T20:33:19.5425917Z )
2025-05-07T20:33:19.5426254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.5426772Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.5427058Z 
2025-05-07T20:33:19.5427140Z     @given(
2025-05-07T20:33:19.5427455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.5427841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.5428166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.5428512Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.5428861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.5429162Z     )
2025-05-07T20:33:19.5429527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.5429991Z     def test_silu_mul_quant(
2025-05-07T20:33:19.5430250Z         self,
2025-05-07T20:33:19.5430456Z         T: int,
2025-05-07T20:33:19.5430662Z         D: int,
2025-05-07T20:33:19.5430895Z         scale_ub: Optional[float],
2025-05-07T20:33:19.5431179Z         contiguous: bool,
2025-05-07T20:33:19.5431434Z         compiled: bool,
2025-05-07T20:33:19.5431679Z     ) -> None:
2025-05-07T20:33:19.5431909Z         torch.manual_seed(2025)
2025-05-07T20:33:19.5432170Z     
2025-05-07T20:33:19.5432462Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.5434695Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.5436620Z 
2025-05-07T20:33:19.5436752Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.5436974Z 
2025-05-07T20:33:19.5437083Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.5437517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.5437941Z     T=128,
2025-05-07T20:33:19.5438134Z     D=7168,
2025-05-07T20:33:19.5438337Z     scale_ub=1200.0,
2025-05-07T20:33:19.5438576Z     contiguous=True,
2025-05-07T20:33:19.5438806Z     compiled=True,
2025-05-07T20:33:19.5439017Z )
2025-05-07T20:33:19.5861862Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.5863309Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.5863942Z 
2025-05-07T20:33:19.5864071Z     @given(
2025-05-07T20:33:19.5864405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.5864758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.5865082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.5865536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.5865887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.5866190Z     )
2025-05-07T20:33:19.5866554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.5867022Z     def test_silu_mul_quant(
2025-05-07T20:33:19.5867283Z         self,
2025-05-07T20:33:19.5867486Z         T: int,
2025-05-07T20:33:19.5867699Z         D: int,
2025-05-07T20:33:19.5867935Z         scale_ub: Optional[float],
2025-05-07T20:33:19.5868217Z         contiguous: bool,
2025-05-07T20:33:19.5868479Z         compiled: bool,
2025-05-07T20:33:19.5868723Z     ) -> None:
2025-05-07T20:33:19.5868952Z         torch.manual_seed(2025)
2025-05-07T20:33:19.5869204Z     
2025-05-07T20:33:19.5869563Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.5869919Z     
2025-05-07T20:33:19.5870121Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.5870433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.5870760Z         x = x_sign * x_clamp
2025-05-07T20:33:19.5871013Z         x0 = x[:, :D]
2025-05-07T20:33:19.5871247Z         x1 = x[:, D:]
2025-05-07T20:33:19.5871535Z     
2025-05-07T20:33:19.5871731Z         if contiguous:
2025-05-07T20:33:19.5872038Z             x0 = x0.contiguous()
2025-05-07T20:33:19.5872315Z             x1 = x1.contiguous()
2025-05-07T20:33:19.5872565Z     
2025-05-07T20:33:19.5872770Z         if scale_ub is not None:
2025-05-07T20:33:19.5873060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.5873408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.5873837Z             )
2025-05-07T20:33:19.5874041Z         else:
2025-05-07T20:33:19.5874264Z             scale_ub_tensor = None
2025-05-07T20:33:19.5874533Z     
2025-05-07T20:33:19.5874778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.5875112Z             op = silu_mul_quant
2025-05-07T20:33:19.5875374Z             if compiled:
2025-05-07T20:33:19.5875635Z                 op = torch.compile(op)
2025-05-07T20:33:19.5875951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.5876238Z     
2025-05-07T20:33:19.5876445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.5876620Z 
2025-05-07T20:33:19.5876734Z moe/activation_test.py:117: 
2025-05-07T20:33:19.5877041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.5877391Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.5877685Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.5878267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.5878853Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.5879542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.5880265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.5880821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.5881532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.5882228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.5882786Z     kernel = self.compile(
2025-05-07T20:33:19.5883367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.5884119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.5884537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.5884783Z 
2025-05-07T20:33:19.5885003Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb483520>
2025-05-07T20:33:19.5886191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.5887638Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbac77f0>}
2025-05-07T20:33:19.5895521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.5896851Z context = <triton._C.libtriton.ir.context object at 0x7efacb4d5a70>
2025-05-07T20:33:19.5897162Z 
2025-05-07T20:33:19.5897343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.5897972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.5898469Z                            module_map=module_map)
2025-05-07T20:33:19.5898855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.5899224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.5899549Z E       ^
2025-05-07T20:33:19.5900081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.5900557Z 
2025-05-07T20:33:19.5900996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.5901538Z 
2025-05-07T20:33:19.5901650Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.5902085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.5902508Z     T=128,
2025-05-07T20:33:19.5902710Z     D=7168,
2025-05-07T20:33:19.5902920Z     scale_ub=1200.0,
2025-05-07T20:33:19.5903159Z     contiguous=True,
2025-05-07T20:33:19.5903394Z     compiled=False,
2025-05-07T20:33:19.5903618Z )
2025-05-07T20:33:19.5903960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.5904472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.5904763Z 
2025-05-07T20:33:19.5904849Z     @given(
2025-05-07T20:33:19.5905098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.5905424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.5905750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.5906103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.5906452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.5906752Z     )
2025-05-07T20:33:19.5907122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.5907590Z     def test_silu_mul_quant(
2025-05-07T20:33:19.5907845Z         self,
2025-05-07T20:33:19.5908055Z         T: int,
2025-05-07T20:33:19.5908270Z         D: int,
2025-05-07T20:33:19.5908502Z         scale_ub: Optional[float],
2025-05-07T20:33:19.5908790Z         contiguous: bool,
2025-05-07T20:33:19.5909047Z         compiled: bool,
2025-05-07T20:33:19.5909282Z     ) -> None:
2025-05-07T20:33:19.5909514Z         torch.manual_seed(2025)
2025-05-07T20:33:19.5909772Z     
2025-05-07T20:33:19.5910068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.5910428Z     
2025-05-07T20:33:19.5910634Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.5910938Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.5913106Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.5915171Z 
2025-05-07T20:33:19.5915302Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.5915531Z 
2025-05-07T20:33:19.5915644Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.5916083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.5916504Z     T=128,
2025-05-07T20:33:19.5916703Z     D=5120,
2025-05-07T20:33:19.5916908Z     scale_ub=1200.0,
2025-05-07T20:33:19.5917141Z     contiguous=True,
2025-05-07T20:33:19.5917376Z     compiled=True,
2025-05-07T20:33:19.5917590Z )
2025-05-07T20:33:19.5917926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.5918491Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.5918775Z 
2025-05-07T20:33:19.5918875Z     @given(
2025-05-07T20:33:19.5919124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.5919454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.5919773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.5920168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.5920556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.5920856Z     )
2025-05-07T20:33:19.5921226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.5921692Z     def test_silu_mul_quant(
2025-05-07T20:33:19.5921944Z         self,
2025-05-07T20:33:19.5922152Z         T: int,
2025-05-07T20:33:19.5922359Z         D: int,
2025-05-07T20:33:19.5922590Z         scale_ub: Optional[float],
2025-05-07T20:33:19.5922881Z         contiguous: bool,
2025-05-07T20:33:19.5923137Z         compiled: bool,
2025-05-07T20:33:19.5923367Z     ) -> None:
2025-05-07T20:33:19.5923610Z         torch.manual_seed(2025)
2025-05-07T20:33:19.5924060Z     
2025-05-07T20:33:19.5924342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.5924705Z     
2025-05-07T20:33:19.5924914Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.5925223Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.5927286Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.5929221Z 
2025-05-07T20:33:19.5929348Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.5929577Z 
2025-05-07T20:33:19.5929690Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.5930121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.5930541Z     T=128,
2025-05-07T20:33:19.5930737Z     D=7168,
2025-05-07T20:33:19.5930940Z     scale_ub=None,
2025-05-07T20:33:19.5931167Z     contiguous=True,
2025-05-07T20:33:19.5931400Z     compiled=True,
2025-05-07T20:33:19.5931617Z )
2025-05-07T20:33:19.7996990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.7997681Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.7998007Z 
2025-05-07T20:33:19.7998098Z     @given(
2025-05-07T20:33:19.7998348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.7998697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.7999037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.7999514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.7999879Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8000192Z     )
2025-05-07T20:33:19.8000574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8001062Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8001334Z         self,
2025-05-07T20:33:19.8001545Z         T: int,
2025-05-07T20:33:19.8001764Z         D: int,
2025-05-07T20:33:19.8002008Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8002303Z         contiguous: bool,
2025-05-07T20:33:19.8002571Z         compiled: bool,
2025-05-07T20:33:19.8002819Z     ) -> None:
2025-05-07T20:33:19.8003059Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8003321Z     
2025-05-07T20:33:19.8003620Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8005964Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.8008026Z 
2025-05-07T20:33:19.8008161Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.8008393Z 
2025-05-07T20:33:19.8018982Z FAILED
2025-05-07T20:33:19.8019351Z 
2025-05-07T20:33:19.8019773Z =================================== FAILURES ===================================
2025-05-07T20:33:19.8020305Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:19.8020817Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:19.8021512Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:19.8022120Z   |     yield
2025-05-07T20:33:19.8022605Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:33:19.8023183Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:19.8023981Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:33:19.8024582Z   |     method()
2025-05-07T20:33:19.8025315Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:19.8026111Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8026822Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:19.8027551Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:19.8028094Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:19.8028638Z   +-+---------------- 1 ----------------
2025-05-07T20:33:19.8028969Z     | Traceback (most recent call last):
2025-05-07T20:33:19.8029749Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.8030600Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8033031Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.8035289Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.8035782Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8036237Z     |     T=2048,
2025-05-07T20:33:19.8036498Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:19.8036879Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:19.8037296Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:19.8037720Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:19.8038069Z     | )
2025-05-07T20:33:19.8038278Z     | 
2025-05-07T20:33:19.8038863Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:19.8039621Z     +---------------- 2 ----------------
2025-05-07T20:33:19.8039953Z     | Traceback (most recent call last):
2025-05-07T20:33:19.8040751Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.8041687Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8043996Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.8046133Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.8047259Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8047715Z     |     T=128,
2025-05-07T20:33:19.8047942Z     |     D=7168,
2025-05-07T20:33:19.8048186Z     |     scale_ub=None,
2025-05-07T20:33:19.8048464Z     |     contiguous=True,
2025-05-07T20:33:19.8048732Z     |     compiled=True,
2025-05-07T20:33:19.8048994Z     | )
2025-05-07T20:33:19.8049207Z     | 
2025-05-07T20:33:19.8049789Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.8050483Z     +---------------- 3 ----------------
2025-05-07T20:33:19.8050813Z     | Traceback (most recent call last):
2025-05-07T20:33:19.8051593Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.8052436Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8055366Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.8057599Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.8058091Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8058549Z     |     T=128,
2025-05-07T20:33:19.8058774Z     |     D=5120,
2025-05-07T20:33:19.8059022Z     |     scale_ub=1200.0,
2025-05-07T20:33:19.8059299Z     |     contiguous=True,
2025-05-07T20:33:19.8059639Z     |     compiled=True,
2025-05-07T20:33:19.8059901Z     | )
2025-05-07T20:33:19.8060112Z     | 
2025-05-07T20:33:19.8060686Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.8061356Z     +---------------- 4 ----------------
2025-05-07T20:33:19.8061685Z     | Traceback (most recent call last):
2025-05-07T20:33:19.8062475Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:19.8063266Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8063997Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:19.8064813Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8065730Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:19.8066610Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8067331Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:19.8068180Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8068998Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:19.8070043Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8071259Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:19.8072472Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8073812Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:19.8074888Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8225229Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:19.8226113Z     |     fn()
2025-05-07T20:33:19.8226996Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:19.8227982Z     |     self.fn.run(
2025-05-07T20:33:19.8228807Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:19.8229683Z     |     kernel = self.compile(
2025-05-07T20:33:19.8230592Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:19.8231646Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8232719Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:19.8234056Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8234837Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8235364Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8235759Z     | ^
2025-05-07T20:33:19.8236452Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8237302Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.8237903Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:19.8238958Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8239615Z     |     T=1,  # or any other generated value
2025-05-07T20:33:19.8240087Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:19.8240603Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:19.8241142Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:19.8241656Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:19.8242117Z     | )
2025-05-07T20:33:19.8242377Z     | 
2025-05-07T20:33:19.8243117Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.8243990Z     +------------------------------------
2025-05-07T20:33:19.8244611Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:19.8245147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8245719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8246299Z     T=1,
2025-05-07T20:33:19.8246603Z     D=5120,
2025-05-07T20:33:19.8246916Z     scale_ub=None,
2025-05-07T20:33:19.8247399Z     contiguous=True,
2025-05-07T20:33:19.8247759Z     compiled=True,
2025-05-07T20:33:19.8248179Z )
2025-05-07T20:33:19.8248688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8249433Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8249854Z 
2025-05-07T20:33:19.8249974Z     @given(
2025-05-07T20:33:19.8250329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8250748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8251095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8251475Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8251843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8252175Z     )
2025-05-07T20:33:19.8252567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8253062Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8253336Z         self,
2025-05-07T20:33:19.8253558Z         T: int,
2025-05-07T20:33:19.8253787Z         D: int,
2025-05-07T20:33:19.8254030Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8254335Z         contiguous: bool,
2025-05-07T20:33:19.8254607Z         compiled: bool,
2025-05-07T20:33:19.8254855Z     ) -> None:
2025-05-07T20:33:19.8255099Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8255373Z     
2025-05-07T20:33:19.8255676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8256058Z     
2025-05-07T20:33:19.8256281Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8256607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8256954Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8257229Z         x0 = x[:, :D]
2025-05-07T20:33:19.8257468Z         x1 = x[:, D:]
2025-05-07T20:33:19.8257703Z     
2025-05-07T20:33:19.8257914Z         if contiguous:
2025-05-07T20:33:19.8258172Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8258464Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8258737Z     
2025-05-07T20:33:19.8258959Z         if scale_ub is not None:
2025-05-07T20:33:19.8259264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8259641Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8259992Z             )
2025-05-07T20:33:19.8260210Z         else:
2025-05-07T20:33:19.8260452Z             scale_ub_tensor = None
2025-05-07T20:33:19.8260738Z     
2025-05-07T20:33:19.8260996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8261353Z             op = silu_mul_quant
2025-05-07T20:33:19.8261638Z             if compiled:
2025-05-07T20:33:19.8261912Z                 op = torch.compile(op)
2025-05-07T20:33:19.8262311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8262626Z     
2025-05-07T20:33:19.8262842Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8263169Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8263506Z     
2025-05-07T20:33:19.8263772Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8264150Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8264479Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8264832Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8265229Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8265581Z     
2025-05-07T20:33:19.8265810Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8266026Z 
2025-05-07T20:33:19.8266190Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8266522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8266903Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8267277Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8268151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8269078Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8269687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8270438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8271204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8272010Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8272846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8273781Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8274587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8275304Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8275975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8276544Z     fn()
2025-05-07T20:33:19.8277106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8277750Z     self.fn.run(
2025-05-07T20:33:19.8278265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8278858Z     kernel = self.compile(
2025-05-07T20:33:19.8279462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8280185Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8280623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8280893Z 
2025-05-07T20:33:19.8281129Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7acbe80>
2025-05-07T20:33:19.8282352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8283893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb7bb8af0>}
2025-05-07T20:33:19.8285431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8286570Z context = <triton._C.libtriton.ir.context object at 0x7efdbdb35d70>
2025-05-07T20:33:19.8286900Z 
2025-05-07T20:33:19.8287088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8302717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8303256Z                            module_map=module_map)
2025-05-07T20:33:19.8303689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8304114Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8304405Z E       ^
2025-05-07T20:33:19.8304930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8305517Z 
2025-05-07T20:33:19.8305996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8306568Z 
2025-05-07T20:33:19.8306684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8307149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8307643Z     T=2048,
2025-05-07T20:33:19.8307854Z     D=5120,
2025-05-07T20:33:19.8308108Z     scale_ub=1200.0,
2025-05-07T20:33:19.8308354Z     contiguous=True,
2025-05-07T20:33:19.8308595Z     compiled=False,
2025-05-07T20:33:19.8308820Z )
2025-05-07T20:33:19.8309174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8309725Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.8310030Z 
2025-05-07T20:33:19.8310114Z     @given(
2025-05-07T20:33:19.8310377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8310732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8311072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8311439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8311800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8312112Z     )
2025-05-07T20:33:19.8312497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8312992Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8313266Z         self,
2025-05-07T20:33:19.8313483Z         T: int,
2025-05-07T20:33:19.8313781Z         D: int,
2025-05-07T20:33:19.8314024Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8314347Z         contiguous: bool,
2025-05-07T20:33:19.8314647Z         compiled: bool,
2025-05-07T20:33:19.8314923Z     ) -> None:
2025-05-07T20:33:19.8315164Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8315437Z     
2025-05-07T20:33:19.8315745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8316126Z     
2025-05-07T20:33:19.8316341Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8316667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8317018Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8317280Z         x0 = x[:, :D]
2025-05-07T20:33:19.8317527Z         x1 = x[:, D:]
2025-05-07T20:33:19.8317756Z     
2025-05-07T20:33:19.8317961Z         if contiguous:
2025-05-07T20:33:19.8318227Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8318519Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8318789Z     
2025-05-07T20:33:19.8319010Z         if scale_ub is not None:
2025-05-07T20:33:19.8319324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8319699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8320055Z             )
2025-05-07T20:33:19.8320277Z         else:
2025-05-07T20:33:19.8320514Z             scale_ub_tensor = None
2025-05-07T20:33:19.8320805Z     
2025-05-07T20:33:19.8321074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8321423Z             op = silu_mul_quant
2025-05-07T20:33:19.8321771Z             if compiled:
2025-05-07T20:33:19.8322056Z                 op = torch.compile(op)
2025-05-07T20:33:19.8322388Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8322701Z     
2025-05-07T20:33:19.8322925Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8323117Z 
2025-05-07T20:33:19.8323238Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8323570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8324331Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8324673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8325441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8326212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8326971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8327734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8328468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8329063Z     kernel = self.compile(
2025-05-07T20:33:19.8329812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8330544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8330993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8331256Z 
2025-05-07T20:33:19.8331488Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb799d960>
2025-05-07T20:33:19.8332696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8334238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb7a95990>}
2025-05-07T20:33:19.8335730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8336870Z context = <triton._C.libtriton.ir.context object at 0x7efdbc266770>
2025-05-07T20:33:19.8337191Z 
2025-05-07T20:33:19.8337385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8337969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8338489Z                            module_map=module_map)
2025-05-07T20:33:19.8338902Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8339298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8339588Z E       ^
2025-05-07T20:33:19.8340115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8340615Z 
2025-05-07T20:33:19.8341088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8341660Z 
2025-05-07T20:33:19.8341788Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8342248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8342697Z     T=2048,
2025-05-07T20:33:19.8342915Z     D=5120,
2025-05-07T20:33:19.8343133Z     scale_ub=1200.0,
2025-05-07T20:33:19.8343390Z     contiguous=True,
2025-05-07T20:33:19.8343641Z     compiled=True,
2025-05-07T20:33:19.8343869Z )
2025-05-07T20:33:19.8344234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8344788Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.8345172Z 
2025-05-07T20:33:19.8345270Z     @given(
2025-05-07T20:33:19.8345534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8345887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8346239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8346610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8346986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8347316Z     )
2025-05-07T20:33:19.8347712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8348214Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8348496Z         self,
2025-05-07T20:33:19.8348713Z         T: int,
2025-05-07T20:33:19.8348945Z         D: int,
2025-05-07T20:33:19.8349247Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8349554Z         contiguous: bool,
2025-05-07T20:33:19.8349823Z         compiled: bool,
2025-05-07T20:33:19.8350077Z     ) -> None:
2025-05-07T20:33:19.8350329Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8350598Z     
2025-05-07T20:33:19.8350906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8351294Z     
2025-05-07T20:33:19.8351567Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8351948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8352304Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8352575Z         x0 = x[:, :D]
2025-05-07T20:33:19.8352826Z         x1 = x[:, D:]
2025-05-07T20:33:19.8353065Z     
2025-05-07T20:33:19.8353272Z         if contiguous:
2025-05-07T20:33:19.8353617Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8353914Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8354181Z     
2025-05-07T20:33:19.8354402Z         if scale_ub is not None:
2025-05-07T20:33:19.8354716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8355090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8355441Z             )
2025-05-07T20:33:19.8355664Z         else:
2025-05-07T20:33:19.8355905Z             scale_ub_tensor = None
2025-05-07T20:33:19.8356186Z     
2025-05-07T20:33:19.8356448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8356807Z             op = silu_mul_quant
2025-05-07T20:33:19.8357089Z             if compiled:
2025-05-07T20:33:19.8357375Z                 op = torch.compile(op)
2025-05-07T20:33:19.8357711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8358017Z     
2025-05-07T20:33:19.8358242Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8358568Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8358893Z     
2025-05-07T20:33:19.8359167Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8359555Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8359884Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8360248Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8360655Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8361008Z     
2025-05-07T20:33:19.8361239Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8361468Z 
2025-05-07T20:33:19.8361588Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8361933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8362309Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8362685Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8363570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8364410Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8365017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8365834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8366606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8367410Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8368258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8369096Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8369909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8370619Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8371338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8371916Z     fn()
2025-05-07T20:33:19.8372487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8373130Z     self.fn.run(
2025-05-07T20:33:19.8373655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8374364Z     kernel = self.compile(
2025-05-07T20:33:19.8375014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8375742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8376188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8376445Z 
2025-05-07T20:33:19.8376684Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7ac86d0>
2025-05-07T20:33:19.8377888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8379414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb65256c0>}
2025-05-07T20:33:19.8380912Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8382054Z context = <triton._C.libtriton.ir.context object at 0x7efdb643dd30>
2025-05-07T20:33:19.8382375Z 
2025-05-07T20:33:19.8382571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8383148Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8383677Z                            module_map=module_map)
2025-05-07T20:33:19.8384115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8384535Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8384842Z E       ^
2025-05-07T20:33:19.8385363Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8385868Z 
2025-05-07T20:33:19.8386340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8386909Z 
2025-05-07T20:33:19.8387028Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8387494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8387952Z     T=16384,
2025-05-07T20:33:19.8388171Z     D=7168,
2025-05-07T20:33:19.8388394Z     scale_ub=1200.0,
2025-05-07T20:33:19.8388649Z     contiguous=False,
2025-05-07T20:33:19.8388899Z     compiled=False,
2025-05-07T20:33:19.8389134Z )
2025-05-07T20:33:19.8389553Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8390119Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.8390433Z 
2025-05-07T20:33:19.8390523Z     @given(
2025-05-07T20:33:19.8390789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8391153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8391498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8391870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8392248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8392567Z     )
2025-05-07T20:33:19.8392968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8393463Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8393845Z         self,
2025-05-07T20:33:19.8394074Z         T: int,
2025-05-07T20:33:19.8394301Z         D: int,
2025-05-07T20:33:19.8394550Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8394857Z         contiguous: bool,
2025-05-07T20:33:19.8395129Z         compiled: bool,
2025-05-07T20:33:19.8395384Z     ) -> None:
2025-05-07T20:33:19.8395624Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8395898Z     
2025-05-07T20:33:19.8396258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8396684Z     
2025-05-07T20:33:19.8396915Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8397250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8397597Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8397873Z         x0 = x[:, :D]
2025-05-07T20:33:19.8398121Z         x1 = x[:, D:]
2025-05-07T20:33:19.8398354Z     
2025-05-07T20:33:19.8398570Z         if contiguous:
2025-05-07T20:33:19.8398838Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8399130Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8399402Z     
2025-05-07T20:33:19.8399620Z         if scale_ub is not None:
2025-05-07T20:33:19.8399932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8400313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8400663Z             )
2025-05-07T20:33:19.8400889Z         else:
2025-05-07T20:33:19.8401128Z             scale_ub_tensor = None
2025-05-07T20:33:19.8401422Z     
2025-05-07T20:33:19.8401692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8402042Z             op = silu_mul_quant
2025-05-07T20:33:19.8402327Z             if compiled:
2025-05-07T20:33:19.8402612Z                 op = torch.compile(op)
2025-05-07T20:33:19.8402944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8403256Z     
2025-05-07T20:33:19.8403483Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8403671Z 
2025-05-07T20:33:19.8403784Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8404125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8404504Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8404832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8405603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8406379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8406988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8407749Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8408493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8409093Z     kernel = self.compile(
2025-05-07T20:33:19.8409705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8410435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8410940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8411199Z 
2025-05-07T20:33:19.8411442Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb6857bb0>
2025-05-07T20:33:19.8412648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8414229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb65248b0>}
2025-05-07T20:33:19.8415729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8416938Z context = <triton._C.libtriton.ir.context object at 0x7efdb64c7230>
2025-05-07T20:33:19.8417264Z 
2025-05-07T20:33:19.8417461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8418050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8418624Z                            module_map=module_map)
2025-05-07T20:33:19.8419077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8419478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8419772Z E       ^
2025-05-07T20:33:19.8420295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8420793Z 
2025-05-07T20:33:19.8421264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8421838Z 
2025-05-07T20:33:19.8421964Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8422429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8422885Z     T=1,
2025-05-07T20:33:19.8423098Z     D=7168,
2025-05-07T20:33:19.8423314Z     scale_ub=None,
2025-05-07T20:33:19.8423561Z     contiguous=True,
2025-05-07T20:33:19.8424115Z     compiled=True,
2025-05-07T20:33:19.8424429Z )
2025-05-07T20:33:19.8424792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8425329Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8425615Z 
2025-05-07T20:33:19.8425703Z     @given(
2025-05-07T20:33:19.8425964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8426315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8426651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8427023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8427396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8427717Z     )
2025-05-07T20:33:19.8428110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8428602Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8428877Z         self,
2025-05-07T20:33:19.8429099Z         T: int,
2025-05-07T20:33:19.8429327Z         D: int,
2025-05-07T20:33:19.8429578Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8429882Z         contiguous: bool,
2025-05-07T20:33:19.8430158Z         compiled: bool,
2025-05-07T20:33:19.8430411Z     ) -> None:
2025-05-07T20:33:19.8430650Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8430923Z     
2025-05-07T20:33:19.8431230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8431608Z     
2025-05-07T20:33:19.8431834Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8432164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8432517Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8432782Z         x0 = x[:, :D]
2025-05-07T20:33:19.8433029Z         x1 = x[:, D:]
2025-05-07T20:33:19.8433265Z     
2025-05-07T20:33:19.8433652Z         if contiguous:
2025-05-07T20:33:19.8433925Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8434217Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8434483Z     
2025-05-07T20:33:19.8434706Z         if scale_ub is not None:
2025-05-07T20:33:19.8435020Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8435391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8435739Z             )
2025-05-07T20:33:19.8435959Z         else:
2025-05-07T20:33:19.8436191Z             scale_ub_tensor = None
2025-05-07T20:33:19.8436474Z     
2025-05-07T20:33:19.8436736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8437081Z             op = silu_mul_quant
2025-05-07T20:33:19.8437363Z             if compiled:
2025-05-07T20:33:19.8437726Z                 op = torch.compile(op)
2025-05-07T20:33:19.8438059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8438364Z     
2025-05-07T20:33:19.8438589Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8438911Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8439233Z     
2025-05-07T20:33:19.8439501Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8439949Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8440332Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8440688Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8441090Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8441431Z     
2025-05-07T20:33:19.8441664Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8441882Z 
2025-05-07T20:33:19.8442000Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8442337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8442718Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8443095Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8443974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8444802Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8445421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8446182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8446945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8447744Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8448585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8449427Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8450236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8450945Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8451622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8452200Z     fn()
2025-05-07T20:33:19.8452762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8453409Z     self.fn.run(
2025-05-07T20:33:19.8453934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8454573Z     kernel = self.compile(
2025-05-07T20:33:19.8455169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8455948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8456391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8456646Z 
2025-05-07T20:33:19.8456877Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb7bb3d00>
2025-05-07T20:33:19.8458074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8459590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c4e50>}
2025-05-07T20:33:19.8461076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8462260Z context = <triton._C.libtriton.ir.context object at 0x7efdb63ec570>
2025-05-07T20:33:19.8462580Z 
2025-05-07T20:33:19.8462768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8463423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8463989Z                            module_map=module_map)
2025-05-07T20:33:19.8464433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8464839Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8465139Z E       ^
2025-05-07T20:33:19.8465658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8466154Z 
2025-05-07T20:33:19.8466613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8467188Z 
2025-05-07T20:33:19.8467306Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8467772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8468217Z     T=4096,
2025-05-07T20:33:19.8468425Z     D=5120,
2025-05-07T20:33:19.8468645Z     scale_ub=None,
2025-05-07T20:33:19.8468892Z     contiguous=False,
2025-05-07T20:33:19.8469146Z     compiled=False,
2025-05-07T20:33:19.8469381Z )
2025-05-07T20:33:19.8469741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8470285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.8470595Z 
2025-05-07T20:33:19.8470683Z     @given(
2025-05-07T20:33:19.8470942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8477683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8478081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8478458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8478834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8479150Z     )
2025-05-07T20:33:19.8479546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8480040Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8480311Z         self,
2025-05-07T20:33:19.8480532Z         T: int,
2025-05-07T20:33:19.8480759Z         D: int,
2025-05-07T20:33:19.8481001Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8481311Z         contiguous: bool,
2025-05-07T20:33:19.8481582Z         compiled: bool,
2025-05-07T20:33:19.8481848Z     ) -> None:
2025-05-07T20:33:19.8482092Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8482365Z     
2025-05-07T20:33:19.8482665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8483057Z     
2025-05-07T20:33:19.8483282Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8483609Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8483952Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8484310Z         x0 = x[:, :D]
2025-05-07T20:33:19.8484559Z         x1 = x[:, D:]
2025-05-07T20:33:19.8484788Z     
2025-05-07T20:33:19.8485001Z         if contiguous:
2025-05-07T20:33:19.8485264Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8485552Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8485826Z     
2025-05-07T20:33:19.8486045Z         if scale_ub is not None:
2025-05-07T20:33:19.8486354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8486732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8487081Z             )
2025-05-07T20:33:19.8487295Z         else:
2025-05-07T20:33:19.8487536Z             scale_ub_tensor = None
2025-05-07T20:33:19.8487822Z     
2025-05-07T20:33:19.8488081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8488490Z             op = silu_mul_quant
2025-05-07T20:33:19.8488773Z             if compiled:
2025-05-07T20:33:19.8489050Z                 op = torch.compile(op)
2025-05-07T20:33:19.8489390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8489702Z     
2025-05-07T20:33:19.8489922Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8490106Z 
2025-05-07T20:33:19.8490217Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8490642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8491015Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8491326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8492103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8492876Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8493480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8494240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8494979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8495574Z     kernel = self.compile(
2025-05-07T20:33:19.8496172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8496906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8497348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8497601Z 
2025-05-07T20:33:19.8497838Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb62137c0>
2025-05-07T20:33:19.8499028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8500554Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c5630>}
2025-05-07T20:33:19.8502037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8503174Z context = <triton._C.libtriton.ir.context object at 0x7efdb61732f0>
2025-05-07T20:33:19.8503494Z 
2025-05-07T20:33:19.8503686Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8504284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8504802Z                            module_map=module_map)
2025-05-07T20:33:19.8505207Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8505596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8505886Z E       ^
2025-05-07T20:33:19.8506449Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8506942Z 
2025-05-07T20:33:19.8507405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8507964Z 
2025-05-07T20:33:19.8508082Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8508541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8508985Z     T=4096,
2025-05-07T20:33:19.8509193Z     D=7168,
2025-05-07T20:33:19.8509412Z     scale_ub=None,
2025-05-07T20:33:19.8509656Z     contiguous=False,
2025-05-07T20:33:19.8509904Z     compiled=False,
2025-05-07T20:33:19.8510133Z )
2025-05-07T20:33:19.8510486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8511081Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.8511381Z 
2025-05-07T20:33:19.8511467Z     @given(
2025-05-07T20:33:19.8511732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8512082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8512419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8512833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8513249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8513641Z     )
2025-05-07T20:33:19.8514034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8514524Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8514797Z         self,
2025-05-07T20:33:19.8515012Z         T: int,
2025-05-07T20:33:19.8515236Z         D: int,
2025-05-07T20:33:19.8515483Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8515780Z         contiguous: bool,
2025-05-07T20:33:19.8516053Z         compiled: bool,
2025-05-07T20:33:19.8516305Z     ) -> None:
2025-05-07T20:33:19.8516542Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8516817Z     
2025-05-07T20:33:19.8517127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8517501Z     
2025-05-07T20:33:19.8517723Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8518047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8518405Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8518676Z         x0 = x[:, :D]
2025-05-07T20:33:19.8518920Z         x1 = x[:, D:]
2025-05-07T20:33:19.8519148Z     
2025-05-07T20:33:19.8519359Z         if contiguous:
2025-05-07T20:33:19.8519618Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8519901Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8520174Z     
2025-05-07T20:33:19.8520392Z         if scale_ub is not None:
2025-05-07T20:33:19.8520693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8521070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8521415Z             )
2025-05-07T20:33:19.8521633Z         else:
2025-05-07T20:33:19.8521866Z             scale_ub_tensor = None
2025-05-07T20:33:19.8522149Z     
2025-05-07T20:33:19.8522413Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8522755Z             op = silu_mul_quant
2025-05-07T20:33:19.8523037Z             if compiled:
2025-05-07T20:33:19.8523320Z                 op = torch.compile(op)
2025-05-07T20:33:19.8523648Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8524421Z     
2025-05-07T20:33:19.8524704Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8524888Z 
2025-05-07T20:33:19.8524998Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8525329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8525696Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8526010Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8526767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8527707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8528310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8529056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8529796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8530388Z     kernel = self.compile(
2025-05-07T20:33:19.8530992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8531709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8532149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8532475Z 
2025-05-07T20:33:19.8532712Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb63050c0>
2025-05-07T20:33:19.8533902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8535539Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb62c5f30>}
2025-05-07T20:33:19.8537012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8538141Z context = <triton._C.libtriton.ir.context object at 0x7efdb616a470>
2025-05-07T20:33:19.8538461Z 
2025-05-07T20:33:19.8538658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8539231Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8539755Z                            module_map=module_map)
2025-05-07T20:33:19.8540159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8540550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8540837Z E       ^
2025-05-07T20:33:19.8541358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8541853Z 
2025-05-07T20:33:19.8542315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8542873Z 
2025-05-07T20:33:19.8542999Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8543452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8543901Z     T=128,
2025-05-07T20:33:19.8544113Z     D=7168,
2025-05-07T20:33:19.8544329Z     scale_ub=None,
2025-05-07T20:33:19.8544571Z     contiguous=False,
2025-05-07T20:33:19.8544825Z     compiled=True,
2025-05-07T20:33:19.8545048Z )
2025-05-07T20:33:19.8545404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8545945Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.8546244Z 
2025-05-07T20:33:19.8546330Z     @given(
2025-05-07T20:33:19.8546590Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8546940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8547283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8547646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8548015Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8548336Z     )
2025-05-07T20:33:19.8548720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8549214Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8549488Z         self,
2025-05-07T20:33:19.8549705Z         T: int,
2025-05-07T20:33:19.8549988Z         D: int,
2025-05-07T20:33:19.8550238Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8550536Z         contiguous: bool,
2025-05-07T20:33:19.8550806Z         compiled: bool,
2025-05-07T20:33:19.8551058Z     ) -> None:
2025-05-07T20:33:19.8551298Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8551573Z     
2025-05-07T20:33:19.8551877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8552251Z     
2025-05-07T20:33:19.8552472Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8552799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8553143Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8553409Z         x0 = x[:, :D]
2025-05-07T20:33:19.8553732Z         x1 = x[:, D:]
2025-05-07T20:33:19.8553970Z     
2025-05-07T20:33:19.8554261Z         if contiguous:
2025-05-07T20:33:19.8554524Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8554815Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8555079Z     
2025-05-07T20:33:19.8555314Z         if scale_ub is not None:
2025-05-07T20:33:19.8555614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8555988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8556396Z             )
2025-05-07T20:33:19.8556608Z         else:
2025-05-07T20:33:19.8556897Z             scale_ub_tensor = None
2025-05-07T20:33:19.8557184Z     
2025-05-07T20:33:19.8557444Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8557794Z             op = silu_mul_quant
2025-05-07T20:33:19.8558075Z             if compiled:
2025-05-07T20:33:19.8558349Z                 op = torch.compile(op)
2025-05-07T20:33:19.8558686Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8558996Z     
2025-05-07T20:33:19.8559219Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8559538Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8559865Z     
2025-05-07T20:33:19.8560137Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8560503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8560830Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8561182Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8561580Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8561929Z     
2025-05-07T20:33:19.8562160Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8562375Z 
2025-05-07T20:33:19.8562486Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8562819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8563199Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8563565Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8564433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8565264Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8565869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8566623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8567386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8568184Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8569014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8569835Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8570646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8571404Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8572070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8572638Z     fn()
2025-05-07T20:33:19.8573206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8573848Z     self.fn.run(
2025-05-07T20:33:19.8574365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8574948Z     kernel = self.compile(
2025-05-07T20:33:19.8575548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8576273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8576757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8577017Z 
2025-05-07T20:33:19.8577249Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb6307b80>
2025-05-07T20:33:19.8578480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8580034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6526a70>}
2025-05-07T20:33:19.8581508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8582628Z context = <triton._C.libtriton.ir.context object at 0x7efd91c3baf0>
2025-05-07T20:33:19.8582954Z 
2025-05-07T20:33:19.8583142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8583723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8584243Z                            module_map=module_map)
2025-05-07T20:33:19.8584642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8585047Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8585347Z E       ^
2025-05-07T20:33:19.8585858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8586358Z 
2025-05-07T20:33:19.8586816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8587383Z 
2025-05-07T20:33:19.8587500Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8587962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8588401Z     T=128,
2025-05-07T20:33:19.8588619Z     D=7168,
2025-05-07T20:33:19.8588839Z     scale_ub=None,
2025-05-07T20:33:19.8589076Z     contiguous=False,
2025-05-07T20:33:19.8589332Z     compiled=False,
2025-05-07T20:33:19.8589564Z )
2025-05-07T20:33:19.8589912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8590466Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.8590761Z 
2025-05-07T20:33:19.8590854Z     @given(
2025-05-07T20:33:19.8591106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8591455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8591798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8592164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8592525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8592848Z     )
2025-05-07T20:33:19.8593235Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8593872Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8594201Z         self,
2025-05-07T20:33:19.8594424Z         T: int,
2025-05-07T20:33:19.8594640Z         D: int,
2025-05-07T20:33:19.8594889Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8595195Z         contiguous: bool,
2025-05-07T20:33:19.8595456Z         compiled: bool,
2025-05-07T20:33:19.8595712Z     ) -> None:
2025-05-07T20:33:19.8595955Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8596219Z     
2025-05-07T20:33:19.8596523Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8596901Z     
2025-05-07T20:33:19.8597116Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8597442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8597786Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8598106Z         x0 = x[:, :D]
2025-05-07T20:33:19.8598342Z         x1 = x[:, D:]
2025-05-07T20:33:19.8598581Z     
2025-05-07T20:33:19.8598793Z         if contiguous:
2025-05-07T20:33:19.8599052Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8599342Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8599615Z     
2025-05-07T20:33:19.8599825Z         if scale_ub is not None:
2025-05-07T20:33:19.8600136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8600606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8600947Z             )
2025-05-07T20:33:19.8601165Z         else:
2025-05-07T20:33:19.8601409Z             scale_ub_tensor = None
2025-05-07T20:33:19.8601689Z     
2025-05-07T20:33:19.8601954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8602302Z             op = silu_mul_quant
2025-05-07T20:33:19.8602578Z             if compiled:
2025-05-07T20:33:19.8602859Z                 op = torch.compile(op)
2025-05-07T20:33:19.8603192Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8603499Z     
2025-05-07T20:33:19.8603715Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8603906Z 
2025-05-07T20:33:19.8604022Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8604352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8604720Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8605032Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8605801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8606559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8607146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8607897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8608625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8609210Z     kernel = self.compile(
2025-05-07T20:33:19.8609810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8610535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8610975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8611229Z 
2025-05-07T20:33:19.8611460Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91d88a90>
2025-05-07T20:33:19.8612644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8614157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6525ab0>}
2025-05-07T20:33:19.8615693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8616812Z context = <triton._C.libtriton.ir.context object at 0x7efd91ca9870>
2025-05-07T20:33:19.8617137Z 
2025-05-07T20:33:19.8617324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8617903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8618421Z                            module_map=module_map)
2025-05-07T20:33:19.8618821Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8619216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8619506Z E       ^
2025-05-07T20:33:19.8620014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8620563Z 
2025-05-07T20:33:19.8621024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8621596Z 
2025-05-07T20:33:19.8621714Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8622175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8622661Z     T=4096,
2025-05-07T20:33:19.8622880Z     D=5120,
2025-05-07T20:33:19.8623145Z     scale_ub=1200.0,
2025-05-07T20:33:19.8623396Z     contiguous=True,
2025-05-07T20:33:19.8623649Z     compiled=False,
2025-05-07T20:33:19.8624193Z )
2025-05-07T20:33:19.8624621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8625168Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.8625479Z 
2025-05-07T20:33:19.8625569Z     @given(
2025-05-07T20:33:19.8625828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8626177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8626522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8626891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8627249Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8627569Z     )
2025-05-07T20:33:19.8627960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8628456Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8628721Z         self,
2025-05-07T20:33:19.8628942Z         T: int,
2025-05-07T20:33:19.8629164Z         D: int,
2025-05-07T20:33:19.8629403Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8629705Z         contiguous: bool,
2025-05-07T20:33:19.8629978Z         compiled: bool,
2025-05-07T20:33:19.8630225Z     ) -> None:
2025-05-07T20:33:19.8630466Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8630736Z     
2025-05-07T20:33:19.8631035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8631413Z     
2025-05-07T20:33:19.8631633Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8631956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8632300Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8632569Z         x0 = x[:, :D]
2025-05-07T20:33:19.8632807Z         x1 = x[:, D:]
2025-05-07T20:33:19.8633043Z     
2025-05-07T20:33:19.8633254Z         if contiguous:
2025-05-07T20:33:19.8633586Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8633878Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8634148Z     
2025-05-07T20:33:19.8634387Z         if scale_ub is not None:
2025-05-07T20:33:19.8634719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8635118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8635462Z             )
2025-05-07T20:33:19.8635677Z         else:
2025-05-07T20:33:19.8635915Z             scale_ub_tensor = None
2025-05-07T20:33:19.8636201Z     
2025-05-07T20:33:19.8636455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8636806Z             op = silu_mul_quant
2025-05-07T20:33:19.8637345Z             if compiled:
2025-05-07T20:33:19.8637642Z                 op = torch.compile(op)
2025-05-07T20:33:19.8651262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8651718Z     
2025-05-07T20:33:19.8652050Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8652271Z 
2025-05-07T20:33:19.8652451Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8652825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8653237Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8653562Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8654335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8655120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8655959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8656722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8657500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8658203Z     kernel = self.compile(
2025-05-07T20:33:19.8658921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8659657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8660111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8660367Z 
2025-05-07T20:33:19.8660608Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91b5cc70>
2025-05-07T20:33:19.8661804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8663345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb633a560>}
2025-05-07T20:33:19.8664843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8665982Z context = <triton._C.libtriton.ir.context object at 0x7efd91c6e930>
2025-05-07T20:33:19.8666304Z 
2025-05-07T20:33:19.8666501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8667082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8667622Z                            module_map=module_map)
2025-05-07T20:33:19.8668043Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8668487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8668811Z E       ^
2025-05-07T20:33:19.8669342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8669962Z 
2025-05-07T20:33:19.8670596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8674946Z 
2025-05-07T20:33:19.8675089Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8675553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8676042Z     T=1,
2025-05-07T20:33:19.8676286Z     D=5120,
2025-05-07T20:33:19.8676530Z     scale_ub=None,
2025-05-07T20:33:19.8676769Z     contiguous=True,
2025-05-07T20:33:19.8677015Z     compiled=True,
2025-05-07T20:33:19.8677245Z )
2025-05-07T20:33:19.8677593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8678218Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8678505Z 
2025-05-07T20:33:19.8678590Z     @given(
2025-05-07T20:33:19.8678843Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8679193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8679535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8679935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8680298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8680611Z     )
2025-05-07T20:33:19.8680996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8681481Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8681748Z         self,
2025-05-07T20:33:19.8702831Z         T: int,
2025-05-07T20:33:19.8703066Z         D: int,
2025-05-07T20:33:19.8703431Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8703557Z         contiguous: bool,
2025-05-07T20:33:19.8703669Z         compiled: bool,
2025-05-07T20:33:19.8703793Z     ) -> None:
2025-05-07T20:33:19.8703901Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8703988Z     
2025-05-07T20:33:19.8704181Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8704337Z     
2025-05-07T20:33:19.8704442Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8704648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8704755Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8704852Z         x0 = x[:, :D]
2025-05-07T20:33:19.8704941Z         x1 = x[:, D:]
2025-05-07T20:33:19.8705027Z     
2025-05-07T20:33:19.8705123Z         if contiguous:
2025-05-07T20:33:19.8705232Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8705334Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8705424Z     
2025-05-07T20:33:19.8705531Z         if scale_ub is not None:
2025-05-07T20:33:19.8705657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8705809Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8705902Z             )
2025-05-07T20:33:19.8705998Z         else:
2025-05-07T20:33:19.8706106Z             scale_ub_tensor = None
2025-05-07T20:33:19.8706190Z     
2025-05-07T20:33:19.8706345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8706455Z             op = silu_mul_quant
2025-05-07T20:33:19.8706551Z             if compiled:
2025-05-07T20:33:19.8706670Z                 op = torch.compile(op)
2025-05-07T20:33:19.8706789Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8706872Z     
2025-05-07T20:33:19.8706983Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8707121Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8707204Z     
2025-05-07T20:33:19.8707362Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8707480Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8707598Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8707736Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8707892Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8707980Z     
2025-05-07T20:33:19.8708093Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8708101Z 
2025-05-07T20:33:19.8708212Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8708369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8708490Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8708647Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8709268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8709383Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8709793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8710106Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8710513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8710809Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8711251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8711538Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8711956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8712146Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8712583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8712672Z     fn()
2025-05-07T20:33:19.8713123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8713219Z     self.fn.run(
2025-05-07T20:33:19.8713668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8713868Z     kernel = self.compile(
2025-05-07T20:33:19.8714290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8714487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8714635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8714641Z 
2025-05-07T20:33:19.8714874Z self = <triton.compiler.compiler.ASTSource object at 0x7efdb61f5300>
2025-05-07T20:33:19.8715752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8716317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efdb6339510>}
2025-05-07T20:33:19.8717148Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8717367Z context = <triton._C.libtriton.ir.context object at 0x7efd915628b0>
2025-05-07T20:33:19.8717372Z 
2025-05-07T20:33:19.8717560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8717863Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8717988Z                            module_map=module_map)
2025-05-07T20:33:19.8718172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8718296Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8718388Z E       ^
2025-05-07T20:33:19.8718790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8718801Z 
2025-05-07T20:33:19.8719259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8719264Z 
2025-05-07T20:33:19.8719381Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8719635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8719722Z     T=2048,
2025-05-07T20:33:19.8719815Z     D=5120,
2025-05-07T20:33:19.8719908Z     scale_ub=None,
2025-05-07T20:33:19.8720009Z     contiguous=True,
2025-05-07T20:33:19.8720111Z     compiled=True,
2025-05-07T20:33:19.8720197Z )
2025-05-07T20:33:19.8720489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8720690Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8720695Z 
2025-05-07T20:33:19.8720786Z     @given(
2025-05-07T20:33:19.8720925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8721048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8721181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8721321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8721456Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8721542Z     )
2025-05-07T20:33:19.8721829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8721938Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8722070Z         self,
2025-05-07T20:33:19.8722166Z         T: int,
2025-05-07T20:33:19.8722254Z         D: int,
2025-05-07T20:33:19.8722366Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8722477Z         contiguous: bool,
2025-05-07T20:33:19.8722576Z         compiled: bool,
2025-05-07T20:33:19.8722666Z     ) -> None:
2025-05-07T20:33:19.8722780Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8722867Z     
2025-05-07T20:33:19.8723186Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8723272Z     
2025-05-07T20:33:19.8723380Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8723532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8723657Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8724130Z         x0 = x[:, :D]
2025-05-07T20:33:19.8724296Z         x1 = x[:, D:]
2025-05-07T20:33:19.8724419Z     
2025-05-07T20:33:19.8724527Z         if contiguous:
2025-05-07T20:33:19.8724643Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8724748Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8724835Z     
2025-05-07T20:33:19.8724947Z         if scale_ub is not None:
2025-05-07T20:33:19.8725071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8725225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8725321Z             )
2025-05-07T20:33:19.8725408Z         else:
2025-05-07T20:33:19.8725526Z             scale_ub_tensor = None
2025-05-07T20:33:19.8725608Z     
2025-05-07T20:33:19.8725759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8725867Z             op = silu_mul_quant
2025-05-07T20:33:19.8725964Z             if compiled:
2025-05-07T20:33:19.8726079Z                 op = torch.compile(op)
2025-05-07T20:33:19.8726207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8726289Z     
2025-05-07T20:33:19.8726392Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8726535Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8726620Z     
2025-05-07T20:33:19.8726773Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8726895Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8727012Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8727161Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8727317Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8727404Z     
2025-05-07T20:33:19.8727526Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8727531Z 
2025-05-07T20:33:19.8727642Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8727786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8727914Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8728066Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8728690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8728810Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8729396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8729658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8730066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8730357Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8730807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8731092Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8731511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8731768Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8732151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8732246Z     fn()
2025-05-07T20:33:19.8732689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8732865Z     self.fn.run(
2025-05-07T20:33:19.8733298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8733408Z     kernel = self.compile(
2025-05-07T20:33:19.8733835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8734032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8734175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8734193Z 
2025-05-07T20:33:19.8734421Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91889960>
2025-05-07T20:33:19.8735277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8735853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91dd7880>}
2025-05-07T20:33:19.8736671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8736890Z context = <triton._C.libtriton.ir.context object at 0x7efd91959eb0>
2025-05-07T20:33:19.8736895Z 
2025-05-07T20:33:19.8737084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8737378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8737510Z                            module_map=module_map)
2025-05-07T20:33:19.8737694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8737811Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8737908Z E       ^
2025-05-07T20:33:19.8738305Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8738311Z 
2025-05-07T20:33:19.8738771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8738776Z 
2025-05-07T20:33:19.8738893Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8739141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8739238Z     T=128,
2025-05-07T20:33:19.8739327Z     D=5120,
2025-05-07T20:33:19.8739429Z     scale_ub=None,
2025-05-07T20:33:19.8739526Z     contiguous=True,
2025-05-07T20:33:19.8739668Z     compiled=True,
2025-05-07T20:33:19.8739762Z )
2025-05-07T20:33:19.8740004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8740192Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8740200Z 
2025-05-07T20:33:19.8740293Z     @given(
2025-05-07T20:33:19.8740430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8740542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8740681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8740814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8740951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8741035Z     )
2025-05-07T20:33:19.8741310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8741472Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8741559Z         self,
2025-05-07T20:33:19.8741647Z         T: int,
2025-05-07T20:33:19.8741747Z         D: int,
2025-05-07T20:33:19.8741860Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8741962Z         contiguous: bool,
2025-05-07T20:33:19.8742067Z         compiled: bool,
2025-05-07T20:33:19.8742207Z     ) -> None:
2025-05-07T20:33:19.8742316Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8742451Z     
2025-05-07T20:33:19.8742643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8742734Z     
2025-05-07T20:33:19.8742839Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8742979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8743085Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8743177Z         x0 = x[:, :D]
2025-05-07T20:33:19.8743268Z         x1 = x[:, D:]
2025-05-07T20:33:19.8743360Z     
2025-05-07T20:33:19.8743461Z         if contiguous:
2025-05-07T20:33:19.8743564Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8743673Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8743757Z     
2025-05-07T20:33:19.8743865Z         if scale_ub is not None:
2025-05-07T20:33:19.8743990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8744142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8744237Z             )
2025-05-07T20:33:19.8744348Z         else:
2025-05-07T20:33:19.8744463Z             scale_ub_tensor = None
2025-05-07T20:33:19.8744556Z     
2025-05-07T20:33:19.8744724Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8744848Z             op = silu_mul_quant
2025-05-07T20:33:19.8744954Z             if compiled:
2025-05-07T20:33:19.8745066Z                 op = torch.compile(op)
2025-05-07T20:33:19.8745188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8745277Z     
2025-05-07T20:33:19.8745383Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8745520Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8745610Z     
2025-05-07T20:33:19.8745765Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8745882Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8746001Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8746138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8746308Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8746394Z     
2025-05-07T20:33:19.8746509Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8746514Z 
2025-05-07T20:33:19.8746634Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8746780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8746900Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8747059Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8747681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8747861Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8748265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8748514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8748933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8749220Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8749667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8749951Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8750419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8750619Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8751000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8751088Z     fn()
2025-05-07T20:33:19.8751626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8751723Z     self.fn.run(
2025-05-07T20:33:19.8752107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8752214Z     kernel = self.compile(
2025-05-07T20:33:19.8752639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8752845Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8752992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8752997Z 
2025-05-07T20:33:19.8753237Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91cd2a10>
2025-05-07T20:33:19.8754281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8754852Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91070700>}
2025-05-07T20:33:19.8755684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8755906Z context = <triton._C.libtriton.ir.context object at 0x7efd913c0370>
2025-05-07T20:33:19.8755912Z 
2025-05-07T20:33:19.8756104Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8756402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8756525Z                            module_map=module_map)
2025-05-07T20:33:19.8756721Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8756845Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8756934Z E       ^
2025-05-07T20:33:19.8757337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8757342Z 
2025-05-07T20:33:19.8757802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8757807Z 
2025-05-07T20:33:19.8757934Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8758187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8758276Z     T=4096,
2025-05-07T20:33:19.8758370Z     D=5120,
2025-05-07T20:33:19.8758515Z     scale_ub=None,
2025-05-07T20:33:19.8758620Z     contiguous=True,
2025-05-07T20:33:19.8758718Z     compiled=True,
2025-05-07T20:33:19.8758803Z )
2025-05-07T20:33:19.8759058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8759256Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8759261Z 
2025-05-07T20:33:19.8759348Z     @given(
2025-05-07T20:33:19.8759491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8759606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8759740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8759885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8760017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8760185Z     )
2025-05-07T20:33:19.8760463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8760575Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8760669Z         self,
2025-05-07T20:33:19.8760762Z         T: int,
2025-05-07T20:33:19.8760855Z         D: int,
2025-05-07T20:33:19.8760972Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8761124Z         contiguous: bool,
2025-05-07T20:33:19.8761224Z         compiled: bool,
2025-05-07T20:33:19.8761363Z     ) -> None:
2025-05-07T20:33:19.8761472Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8761559Z     
2025-05-07T20:33:19.8761757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8761842Z     
2025-05-07T20:33:19.8761949Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8762099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8762201Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8762304Z         x0 = x[:, :D]
2025-05-07T20:33:19.8762395Z         x1 = x[:, D:]
2025-05-07T20:33:19.8762479Z     
2025-05-07T20:33:19.8762582Z         if contiguous:
2025-05-07T20:33:19.8762689Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8762792Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8762882Z     
2025-05-07T20:33:19.8762988Z         if scale_ub is not None:
2025-05-07T20:33:19.8763109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8763275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8763361Z             )
2025-05-07T20:33:19.8763448Z         else:
2025-05-07T20:33:19.8763564Z             scale_ub_tensor = None
2025-05-07T20:33:19.8763647Z     
2025-05-07T20:33:19.8763796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8763917Z             op = silu_mul_quant
2025-05-07T20:33:19.8764022Z             if compiled:
2025-05-07T20:33:19.8764136Z                 op = torch.compile(op)
2025-05-07T20:33:19.8764260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8764350Z     
2025-05-07T20:33:19.8764456Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8764595Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8764687Z     
2025-05-07T20:33:19.8764841Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8764956Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8765081Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8765225Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8765390Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8765474Z     
2025-05-07T20:33:19.8765589Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8765594Z 
2025-05-07T20:33:19.8765713Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8765858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8765979Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8766142Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8766816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8766941Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8767345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8767602Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8768019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8768308Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8768752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8769091Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8769512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8769710Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8770094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8770269Z     fn()
2025-05-07T20:33:19.8770726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8770821Z     self.fn.run(
2025-05-07T20:33:19.8771208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8771317Z     kernel = self.compile(
2025-05-07T20:33:19.8771742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8771950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8772097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8772102Z 
2025-05-07T20:33:19.8772333Z self = <triton.compiler.compiler.ASTSource object at 0x7efd91504af0>
2025-05-07T20:33:19.8773210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8773779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90fa09d0>}
2025-05-07T20:33:19.8774614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8774835Z context = <triton._C.libtriton.ir.context object at 0x7efd912a45b0>
2025-05-07T20:33:19.8774842Z 
2025-05-07T20:33:19.8775035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8775330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8775457Z                            module_map=module_map)
2025-05-07T20:33:19.8775651Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8775768Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8775857Z E       ^
2025-05-07T20:33:19.8776263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8776269Z 
2025-05-07T20:33:19.8776729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8776736Z 
2025-05-07T20:33:19.8776862Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8777157Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8777249Z     T=16384,
2025-05-07T20:33:19.8777342Z     D=5120,
2025-05-07T20:33:19.8777437Z     scale_ub=None,
2025-05-07T20:33:19.8777534Z     contiguous=True,
2025-05-07T20:33:19.8777636Z     compiled=True,
2025-05-07T20:33:19.8777724Z )
2025-05-07T20:33:19.8777971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8778176Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.8778181Z 
2025-05-07T20:33:19.8778269Z     @given(
2025-05-07T20:33:19.8778416Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8778529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8778660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8778849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8778980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8779066Z     )
2025-05-07T20:33:19.8779355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8779464Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8779558Z         self,
2025-05-07T20:33:19.8779646Z         T: int,
2025-05-07T20:33:19.8779780Z         D: int,
2025-05-07T20:33:19.8779938Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8780042Z         contiguous: bool,
2025-05-07T20:33:19.8780141Z         compiled: bool,
2025-05-07T20:33:19.8780237Z     ) -> None:
2025-05-07T20:33:19.8780344Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8780429Z     
2025-05-07T20:33:19.8780626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8780711Z     
2025-05-07T20:33:19.8780818Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8780966Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8781074Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8781173Z         x0 = x[:, :D]
2025-05-07T20:33:19.8781268Z         x1 = x[:, D:]
2025-05-07T20:33:19.8781354Z     
2025-05-07T20:33:19.8781458Z         if contiguous:
2025-05-07T20:33:19.8781563Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8781665Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8781759Z     
2025-05-07T20:33:19.8781863Z         if scale_ub is not None:
2025-05-07T20:33:19.8781988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8782152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8782240Z             )
2025-05-07T20:33:19.8782328Z         else:
2025-05-07T20:33:19.8782444Z             scale_ub_tensor = None
2025-05-07T20:33:19.8782529Z     
2025-05-07T20:33:19.8782675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8782786Z             op = silu_mul_quant
2025-05-07T20:33:19.8782886Z             if compiled:
2025-05-07T20:33:19.8783009Z                 op = torch.compile(op)
2025-05-07T20:33:19.8783131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8783214Z     
2025-05-07T20:33:19.8783329Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8783467Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8783552Z     
2025-05-07T20:33:19.8783713Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8783834Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8783950Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8784102Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8784284Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8784396Z     
2025-05-07T20:33:19.8784512Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8784517Z 
2025-05-07T20:33:19.8784629Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8784782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8784905Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8785126Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8785757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8785877Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8786289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8786542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8786954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8787249Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8787740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8788028Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8788457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8788647Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8789148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8789238Z     fn()
2025-05-07T20:33:19.8789685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8789790Z     self.fn.run(
2025-05-07T20:33:19.8790170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8790283Z     kernel = self.compile(
2025-05-07T20:33:19.8790715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8790917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8791068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8791073Z 
2025-05-07T20:33:19.8791304Z self = <triton.compiler.compiler.ASTSource object at 0x7efd910f1570>
2025-05-07T20:33:19.8792177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8792751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90fa2170>}
2025-05-07T20:33:19.8793684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8793916Z context = <triton._C.libtriton.ir.context object at 0x7efd90c7d9f0>
2025-05-07T20:33:19.8793921Z 
2025-05-07T20:33:19.8794109Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8794416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8794540Z                            module_map=module_map)
2025-05-07T20:33:19.8794723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8794846Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8794934Z E       ^
2025-05-07T20:33:19.8795333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8795339Z 
2025-05-07T20:33:19.8795810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8795815Z 
2025-05-07T20:33:19.8796023Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8796281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8796369Z     T=1,
2025-05-07T20:33:19.8796459Z     D=5120,
2025-05-07T20:33:19.8796564Z     scale_ub=1200.0,
2025-05-07T20:33:19.8796661Z     contiguous=True,
2025-05-07T20:33:19.8796760Z     compiled=True,
2025-05-07T20:33:19.8796851Z )
2025-05-07T20:33:19.8797096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8797290Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.8797295Z 
2025-05-07T20:33:19.8797383Z     @given(
2025-05-07T20:33:19.8797520Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8797638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8797818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8797951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8798089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8798175Z     )
2025-05-07T20:33:19.8798451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8798565Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8798700Z         self,
2025-05-07T20:33:19.8798834Z         T: int,
2025-05-07T20:33:19.8798923Z         D: int,
2025-05-07T20:33:19.8799036Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8799143Z         contiguous: bool,
2025-05-07T20:33:19.8799242Z         compiled: bool,
2025-05-07T20:33:19.8799332Z     ) -> None:
2025-05-07T20:33:19.8799446Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8799529Z     
2025-05-07T20:33:19.8799720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8799814Z     
2025-05-07T20:33:19.8799920Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8800062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8800171Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8800266Z         x0 = x[:, :D]
2025-05-07T20:33:19.8800358Z         x1 = x[:, D:]
2025-05-07T20:33:19.8800448Z     
2025-05-07T20:33:19.8800551Z         if contiguous:
2025-05-07T20:33:19.8800662Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8800770Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8800861Z     
2025-05-07T20:33:19.8800972Z         if scale_ub is not None:
2025-05-07T20:33:19.8801094Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8801247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8801343Z             )
2025-05-07T20:33:19.8801432Z         else:
2025-05-07T20:33:19.8801541Z             scale_ub_tensor = None
2025-05-07T20:33:19.8801633Z     
2025-05-07T20:33:19.8801781Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8801889Z             op = silu_mul_quant
2025-05-07T20:33:19.8801993Z             if compiled:
2025-05-07T20:33:19.8802109Z                 op = torch.compile(op)
2025-05-07T20:33:19.8802239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8802326Z     
2025-05-07T20:33:19.8802430Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8802434Z 
2025-05-07T20:33:19.8802553Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8802702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8802822Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8802946Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8803359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.8803467Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.8804029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8804144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8804605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8804858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8805241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8805361Z     kernel = self.compile(
2025-05-07T20:33:19.8805790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8805997Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8806142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8806147Z 
2025-05-07T20:33:19.8806380Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90ead600>
2025-05-07T20:33:19.8807299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8807867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90f9a050>}
2025-05-07T20:33:19.8808793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8809012Z context = <triton._C.libtriton.ir.context object at 0x7efd905caab0>
2025-05-07T20:33:19.8809018Z 
2025-05-07T20:33:19.8809204Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8809508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8809639Z                            module_map=module_map)
2025-05-07T20:33:19.8809835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8809951Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8810044Z E       ^
2025-05-07T20:33:19.8810448Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8810459Z 
2025-05-07T20:33:19.8810921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8810926Z 
2025-05-07T20:33:19.8811051Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8811301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8811389Z     T=1,
2025-05-07T20:33:19.8811486Z     D=5120,
2025-05-07T20:33:19.8811582Z     scale_ub=None,
2025-05-07T20:33:19.8811684Z     contiguous=False,
2025-05-07T20:33:19.8811787Z     compiled=True,
2025-05-07T20:33:19.8811873Z )
2025-05-07T20:33:19.8812119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8812312Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.8812317Z 
2025-05-07T20:33:19.8812406Z     @given(
2025-05-07T20:33:19.8812547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8812667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8812800Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8812940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8813071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8813155Z     )
2025-05-07T20:33:19.8813440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8813548Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8813636Z         self,
2025-05-07T20:33:19.8813735Z         T: int,
2025-05-07T20:33:19.8813831Z         D: int,
2025-05-07T20:33:19.8813948Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8814112Z         contiguous: bool,
2025-05-07T20:33:19.8814213Z         compiled: bool,
2025-05-07T20:33:19.8814307Z     ) -> None:
2025-05-07T20:33:19.8814417Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8814501Z     
2025-05-07T20:33:19.8814705Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8814794Z     
2025-05-07T20:33:19.8814899Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8815048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8815151Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8815242Z         x0 = x[:, :D]
2025-05-07T20:33:19.8815344Z         x1 = x[:, D:]
2025-05-07T20:33:19.8815428Z     
2025-05-07T20:33:19.8815524Z         if contiguous:
2025-05-07T20:33:19.8815636Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8815786Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8815878Z     
2025-05-07T20:33:19.8815982Z         if scale_ub is not None:
2025-05-07T20:33:19.8816103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8816268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8816356Z             )
2025-05-07T20:33:19.8816445Z         else:
2025-05-07T20:33:19.8816559Z             scale_ub_tensor = None
2025-05-07T20:33:19.8816687Z     
2025-05-07T20:33:19.8816876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8816988Z             op = silu_mul_quant
2025-05-07T20:33:19.8817086Z             if compiled:
2025-05-07T20:33:19.8817200Z                 op = torch.compile(op)
2025-05-07T20:33:19.8817327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8817411Z     
2025-05-07T20:33:19.8817523Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8817662Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8817751Z     
2025-05-07T20:33:19.8817912Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8818028Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8818145Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8818291Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8818452Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8818541Z     
2025-05-07T20:33:19.8818664Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8818672Z 
2025-05-07T20:33:19.8818784Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8818937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8819059Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8819213Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8819844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8819963Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8820628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8821040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8821341Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8821788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8822076Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8822502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8822695Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8823143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8823234Z     fn()
2025-05-07T20:33:19.8823685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8823797Z     self.fn.run(
2025-05-07T20:33:19.8824536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8824649Z     kernel = self.compile(
2025-05-07T20:33:19.8825080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8825283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8825433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8825606Z 
2025-05-07T20:33:19.8825839Z self = <triton.compiler.compiler.ASTSource object at 0x7efd912f3610>
2025-05-07T20:33:19.8839926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8840829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91499900>}
2025-05-07T20:33:19.8841676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8841910Z context = <triton._C.libtriton.ir.context object at 0x7efd9052fab0>
2025-05-07T20:33:19.8841916Z 
2025-05-07T20:33:19.8842112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8842421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8842555Z                            module_map=module_map)
2025-05-07T20:33:19.8842744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8842863Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8842963Z E       ^
2025-05-07T20:33:19.8843368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8843373Z 
2025-05-07T20:33:19.8843847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8843852Z 
2025-05-07T20:33:19.8843973Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8844225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8844325Z     T=1,
2025-05-07T20:33:19.8844414Z     D=5120,
2025-05-07T20:33:19.8844510Z     scale_ub=None,
2025-05-07T20:33:19.8844616Z     contiguous=True,
2025-05-07T20:33:19.8844714Z     compiled=False,
2025-05-07T20:33:19.8844810Z )
2025-05-07T20:33:19.8845057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8845243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.8845251Z 
2025-05-07T20:33:19.8845349Z     @given(
2025-05-07T20:33:19.8845489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8845605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8845746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8845881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8846011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8846104Z     )
2025-05-07T20:33:19.8846383Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8846500Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8846590Z         self,
2025-05-07T20:33:19.8846683Z         T: int,
2025-05-07T20:33:19.8846778Z         D: int,
2025-05-07T20:33:19.8846978Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8847084Z         contiguous: bool,
2025-05-07T20:33:19.8847190Z         compiled: bool,
2025-05-07T20:33:19.8847281Z     ) -> None:
2025-05-07T20:33:19.8847392Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8847483Z     
2025-05-07T20:33:19.8847679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8847766Z     
2025-05-07T20:33:19.8847878Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8848021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8848130Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8848224Z         x0 = x[:, :D]
2025-05-07T20:33:19.8848318Z         x1 = x[:, D:]
2025-05-07T20:33:19.8848407Z     
2025-05-07T20:33:19.8848554Z         if contiguous:
2025-05-07T20:33:19.8848661Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8848768Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8848853Z     
2025-05-07T20:33:19.8848961Z         if scale_ub is not None:
2025-05-07T20:33:19.8849085Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8849237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8849367Z             )
2025-05-07T20:33:19.8849462Z         else:
2025-05-07T20:33:19.8849609Z             scale_ub_tensor = None
2025-05-07T20:33:19.8849693Z     
2025-05-07T20:33:19.8849847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8849950Z             op = silu_mul_quant
2025-05-07T20:33:19.8850056Z             if compiled:
2025-05-07T20:33:19.8850169Z                 op = torch.compile(op)
2025-05-07T20:33:19.8850288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8850377Z     
2025-05-07T20:33:19.8850480Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8850488Z 
2025-05-07T20:33:19.8850598Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8850749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8850868Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8850981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8851544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8851661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8852070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8852323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8852700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8852813Z     kernel = self.compile(
2025-05-07T20:33:19.8853243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8853448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8853617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8853622Z 
2025-05-07T20:33:19.8853877Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90b9ea70>
2025-05-07T20:33:19.8854754Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8855318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd914988b0>}
2025-05-07T20:33:19.8856149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8856417Z context = <triton._C.libtriton.ir.context object at 0x7efd90828b30>
2025-05-07T20:33:19.8856423Z 
2025-05-07T20:33:19.8856610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8856914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8857041Z                            module_map=module_map)
2025-05-07T20:33:19.8857232Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8857344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8857431Z E       ^
2025-05-07T20:33:19.8857831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8857835Z 
2025-05-07T20:33:19.8858294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8858345Z 
2025-05-07T20:33:19.8858472Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8858724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8858813Z     T=128,
2025-05-07T20:33:19.8858909Z     D=5120,
2025-05-07T20:33:19.8859003Z     scale_ub=None,
2025-05-07T20:33:19.8859146Z     contiguous=False,
2025-05-07T20:33:19.8859292Z     compiled=True,
2025-05-07T20:33:19.8859376Z )
2025-05-07T20:33:19.8859620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8859819Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.8859824Z 
2025-05-07T20:33:19.8859912Z     @given(
2025-05-07T20:33:19.8860054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8860169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8860301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8860443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8860572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8860659Z     )
2025-05-07T20:33:19.8860939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8861047Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8861140Z         self,
2025-05-07T20:33:19.8861239Z         T: int,
2025-05-07T20:33:19.8861329Z         D: int,
2025-05-07T20:33:19.8861441Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8861550Z         contiguous: bool,
2025-05-07T20:33:19.8861648Z         compiled: bool,
2025-05-07T20:33:19.8861745Z     ) -> None:
2025-05-07T20:33:19.8861853Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8861937Z     
2025-05-07T20:33:19.8862133Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8862218Z     
2025-05-07T20:33:19.8862327Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8862475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8862576Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8862669Z         x0 = x[:, :D]
2025-05-07T20:33:19.8862766Z         x1 = x[:, D:]
2025-05-07T20:33:19.8862848Z     
2025-05-07T20:33:19.8862944Z         if contiguous:
2025-05-07T20:33:19.8863056Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8863159Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8863247Z     
2025-05-07T20:33:19.8863357Z         if scale_ub is not None:
2025-05-07T20:33:19.8863475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8863634Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8863719Z             )
2025-05-07T20:33:19.8863806Z         else:
2025-05-07T20:33:19.8863918Z             scale_ub_tensor = None
2025-05-07T20:33:19.8863999Z     
2025-05-07T20:33:19.8864161Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8864283Z             op = silu_mul_quant
2025-05-07T20:33:19.8864401Z             if compiled:
2025-05-07T20:33:19.8864514Z                 op = torch.compile(op)
2025-05-07T20:33:19.8864692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8864776Z     
2025-05-07T20:33:19.8864885Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8864890Z 
2025-05-07T20:33:19.8864999Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8865148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8865268Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8865380Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8865791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.8865902Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.8866449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8866612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8867013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8867263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8867648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8867870Z     kernel = self.compile(
2025-05-07T20:33:19.8868298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8868503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8868647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8868652Z 
2025-05-07T20:33:19.8868886Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90512080>
2025-05-07T20:33:19.8869755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8870318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9149b880>}
2025-05-07T20:33:19.8871156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8871371Z context = <triton._C.libtriton.ir.context object at 0x7efd9085d830>
2025-05-07T20:33:19.8871376Z 
2025-05-07T20:33:19.8871568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8871866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8871997Z                            module_map=module_map)
2025-05-07T20:33:19.8872178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8872296Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8872390Z E       ^
2025-05-07T20:33:19.8872787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8872795Z 
2025-05-07T20:33:19.8873255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8873261Z 
2025-05-07T20:33:19.8873384Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8873731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8873825Z     T=128,
2025-05-07T20:33:19.8873912Z     D=7168,
2025-05-07T20:33:19.8874006Z     scale_ub=1200.0,
2025-05-07T20:33:19.8874119Z     contiguous=False,
2025-05-07T20:33:19.8874237Z     compiled=False,
2025-05-07T20:33:19.8874331Z )
2025-05-07T20:33:19.8874593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8874844Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.8874850Z 
2025-05-07T20:33:19.8874937Z     @given(
2025-05-07T20:33:19.8875076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8875195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8875337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8875471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8875599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8875691Z     )
2025-05-07T20:33:19.8875966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8876073Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8876167Z         self,
2025-05-07T20:33:19.8876304Z         T: int,
2025-05-07T20:33:19.8876391Z         D: int,
2025-05-07T20:33:19.8876508Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8876610Z         contiguous: bool,
2025-05-07T20:33:19.8876711Z         compiled: bool,
2025-05-07T20:33:19.8876808Z     ) -> None:
2025-05-07T20:33:19.8876917Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8877006Z     
2025-05-07T20:33:19.8877199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8877330Z     
2025-05-07T20:33:19.8877479Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8877621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8877724Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8877824Z         x0 = x[:, :D]
2025-05-07T20:33:19.8877915Z         x1 = x[:, D:]
2025-05-07T20:33:19.8877999Z     
2025-05-07T20:33:19.8878100Z         if contiguous:
2025-05-07T20:33:19.8878204Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8878305Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8878398Z     
2025-05-07T20:33:19.8878501Z         if scale_ub is not None:
2025-05-07T20:33:19.8878625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8878780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8878866Z             )
2025-05-07T20:33:19.8878959Z         else:
2025-05-07T20:33:19.8879067Z             scale_ub_tensor = None
2025-05-07T20:33:19.8879155Z     
2025-05-07T20:33:19.8879312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8879414Z             op = silu_mul_quant
2025-05-07T20:33:19.8879510Z             if compiled:
2025-05-07T20:33:19.8879629Z                 op = torch.compile(op)
2025-05-07T20:33:19.8879748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8879832Z     
2025-05-07T20:33:19.8879940Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8879945Z 
2025-05-07T20:33:19.8880056Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8880208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8880326Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8880439Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8881004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8881113Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8881519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8881776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8882155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8882266Z     kernel = self.compile(
2025-05-07T20:33:19.8882694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8882894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8883042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8883098Z 
2025-05-07T20:33:19.8883330Z self = <triton.compiler.compiler.ASTSource object at 0x7efd905123e0>
2025-05-07T20:33:19.8884200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8884845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd91499090>}
2025-05-07T20:33:19.8885677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8885942Z context = <triton._C.libtriton.ir.context object at 0x7efd90814830>
2025-05-07T20:33:19.8885947Z 
2025-05-07T20:33:19.8886136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8886442Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8886564Z                            module_map=module_map)
2025-05-07T20:33:19.8886831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8886952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8887040Z E       ^
2025-05-07T20:33:19.8887440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8887451Z 
2025-05-07T20:33:19.8887917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8887925Z 
2025-05-07T20:33:19.8888043Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8888301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8888392Z     T=128,
2025-05-07T20:33:19.8888480Z     D=5120,
2025-05-07T20:33:19.8888580Z     scale_ub=None,
2025-05-07T20:33:19.8888680Z     contiguous=False,
2025-05-07T20:33:19.8888781Z     compiled=False,
2025-05-07T20:33:19.8888865Z )
2025-05-07T20:33:19.8889114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8889312Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.8889317Z 
2025-05-07T20:33:19.8889408Z     @given(
2025-05-07T20:33:19.8889544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8889665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8889798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8889931Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8890070Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8890156Z     )
2025-05-07T20:33:19.8890440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8890548Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8890636Z         self,
2025-05-07T20:33:19.8890731Z         T: int,
2025-05-07T20:33:19.8890818Z         D: int,
2025-05-07T20:33:19.8890935Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8891047Z         contiguous: bool,
2025-05-07T20:33:19.8891146Z         compiled: bool,
2025-05-07T20:33:19.8891237Z     ) -> None:
2025-05-07T20:33:19.8891352Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8891437Z     
2025-05-07T20:33:19.8891631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8891724Z     
2025-05-07T20:33:19.8891834Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8891982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8892090Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8892181Z         x0 = x[:, :D]
2025-05-07T20:33:19.8892280Z         x1 = x[:, D:]
2025-05-07T20:33:19.8892364Z     
2025-05-07T20:33:19.8892513Z         if contiguous:
2025-05-07T20:33:19.8892626Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8892729Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8892813Z     
2025-05-07T20:33:19.8892924Z         if scale_ub is not None:
2025-05-07T20:33:19.8893047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8893204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8893298Z             )
2025-05-07T20:33:19.8893388Z         else:
2025-05-07T20:33:19.8893499Z             scale_ub_tensor = None
2025-05-07T20:33:19.8893613Z     
2025-05-07T20:33:19.8893782Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8893899Z             op = silu_mul_quant
2025-05-07T20:33:19.8893996Z             if compiled:
2025-05-07T20:33:19.8894155Z                 op = torch.compile(op)
2025-05-07T20:33:19.8894282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8894365Z     
2025-05-07T20:33:19.8894473Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8894477Z 
2025-05-07T20:33:19.8894596Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8894740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8894901Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8895060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8895622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8895742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8896148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8896400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8896794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8896903Z     kernel = self.compile(
2025-05-07T20:33:19.8897339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8897542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8897690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8897696Z 
2025-05-07T20:33:19.8897933Z self = <triton.compiler.compiler.ASTSource object at 0x7efd908c4e20>
2025-05-07T20:33:19.8898803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8899377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a3eb0>}
2025-05-07T20:33:19.8900213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8900431Z context = <triton._C.libtriton.ir.context object at 0x7efd90495230>
2025-05-07T20:33:19.8900440Z 
2025-05-07T20:33:19.8900637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8900935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8901064Z                            module_map=module_map)
2025-05-07T20:33:19.8901247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8901358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8901453Z E       ^
2025-05-07T20:33:19.8901854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8901859Z 
2025-05-07T20:33:19.8902374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8902386Z 
2025-05-07T20:33:19.8902506Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8902758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8902855Z     T=128,
2025-05-07T20:33:19.8902945Z     D=5120,
2025-05-07T20:33:19.8903041Z     scale_ub=1200.0,
2025-05-07T20:33:19.8903144Z     contiguous=True,
2025-05-07T20:33:19.8903239Z     compiled=False,
2025-05-07T20:33:19.8903323Z )
2025-05-07T20:33:19.8903594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8903822Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.8903829Z 
2025-05-07T20:33:19.8903970Z     @given(
2025-05-07T20:33:19.8904103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8904217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8904357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8904493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8904623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8904795Z     )
2025-05-07T20:33:19.8905126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8905235Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8905326Z         self,
2025-05-07T20:33:19.8905414Z         T: int,
2025-05-07T20:33:19.8905501Z         D: int,
2025-05-07T20:33:19.8905621Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8905723Z         contiguous: bool,
2025-05-07T20:33:19.8905826Z         compiled: bool,
2025-05-07T20:33:19.8905915Z     ) -> None:
2025-05-07T20:33:19.8906022Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8906116Z     
2025-05-07T20:33:19.8906308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8906392Z     
2025-05-07T20:33:19.8906505Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8906646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8906747Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8906845Z         x0 = x[:, :D]
2025-05-07T20:33:19.8906941Z         x1 = x[:, D:]
2025-05-07T20:33:19.8907024Z     
2025-05-07T20:33:19.8907127Z         if contiguous:
2025-05-07T20:33:19.8907231Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8907352Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8907435Z     
2025-05-07T20:33:19.8907537Z         if scale_ub is not None:
2025-05-07T20:33:19.8907661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8907815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8907904Z             )
2025-05-07T20:33:19.8908004Z         else:
2025-05-07T20:33:19.8908113Z             scale_ub_tensor = None
2025-05-07T20:33:19.8908196Z     
2025-05-07T20:33:19.8908350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8908456Z             op = silu_mul_quant
2025-05-07T20:33:19.8908554Z             if compiled:
2025-05-07T20:33:19.8908675Z                 op = torch.compile(op)
2025-05-07T20:33:19.8908797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8908888Z     
2025-05-07T20:33:19.8908995Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8909000Z 
2025-05-07T20:33:19.8909112Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8909265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8909380Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8909494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8910066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8910180Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8910644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8910900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8911283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8911406Z     kernel = self.compile(
2025-05-07T20:33:19.8911838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8912037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8912187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8912192Z 
2025-05-07T20:33:19.8912421Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90846140>
2025-05-07T20:33:19.8913350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8914077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a2d40>}
2025-05-07T20:33:19.8915020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8915239Z context = <triton._C.libtriton.ir.context object at 0x7efd904248b0>
2025-05-07T20:33:19.8915243Z 
2025-05-07T20:33:19.8915431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8915740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8915865Z                            module_map=module_map)
2025-05-07T20:33:19.8916050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8916170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8916258Z E       ^
2025-05-07T20:33:19.8916662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8916671Z 
2025-05-07T20:33:19.8917137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8917142Z 
2025-05-07T20:33:19.8917260Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8917516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8917608Z     T=1,
2025-05-07T20:33:19.8917701Z     D=7168,
2025-05-07T20:33:19.8917797Z     scale_ub=1200.0,
2025-05-07T20:33:19.8917896Z     contiguous=True,
2025-05-07T20:33:19.8917997Z     compiled=True,
2025-05-07T20:33:19.8918081Z )
2025-05-07T20:33:19.8918325Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8918518Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.8918523Z 
2025-05-07T20:33:19.8918617Z     @given(
2025-05-07T20:33:19.8918751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8918878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8919009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8919148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8919276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8919363Z     )
2025-05-07T20:33:19.8919645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8919753Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8919841Z         self,
2025-05-07T20:33:19.8919939Z         T: int,
2025-05-07T20:33:19.8920027Z         D: int,
2025-05-07T20:33:19.8920140Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8920249Z         contiguous: bool,
2025-05-07T20:33:19.8920396Z         compiled: bool,
2025-05-07T20:33:19.8920489Z     ) -> None:
2025-05-07T20:33:19.8920603Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8920687Z     
2025-05-07T20:33:19.8920879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8920975Z     
2025-05-07T20:33:19.8921080Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8921230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8921332Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8921423Z         x0 = x[:, :D]
2025-05-07T20:33:19.8921519Z         x1 = x[:, D:]
2025-05-07T20:33:19.8921602Z     
2025-05-07T20:33:19.8921697Z         if contiguous:
2025-05-07T20:33:19.8921809Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8921911Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8922041Z     
2025-05-07T20:33:19.8922153Z         if scale_ub is not None:
2025-05-07T20:33:19.8922274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8922429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8922521Z             )
2025-05-07T20:33:19.8922608Z         else:
2025-05-07T20:33:19.8922720Z             scale_ub_tensor = None
2025-05-07T20:33:19.8922851Z     
2025-05-07T20:33:19.8923041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8923155Z             op = silu_mul_quant
2025-05-07T20:33:19.8923251Z             if compiled:
2025-05-07T20:33:19.8923365Z                 op = torch.compile(op)
2025-05-07T20:33:19.8923492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8923576Z     
2025-05-07T20:33:19.8923679Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8923684Z 
2025-05-07T20:33:19.8924093Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8924344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8924517Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8924635Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8925050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.8925163Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.8925723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8925834Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8926240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8926489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8926874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8926983Z     kernel = self.compile(
2025-05-07T20:33:19.8927411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8927618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8927760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8927767Z 
2025-05-07T20:33:19.8927998Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90414730>
2025-05-07T20:33:19.8928864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8929423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a1480>}
2025-05-07T20:33:19.8930460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8930681Z context = <triton._C.libtriton.ir.context object at 0x7efd906a57f0>
2025-05-07T20:33:19.8930686Z 
2025-05-07T20:33:19.8930881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8931182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8931304Z                            module_map=module_map)
2025-05-07T20:33:19.8931495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8931609Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8931700Z E       ^
2025-05-07T20:33:19.8932103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8932181Z 
2025-05-07T20:33:19.8932642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8932647Z 
2025-05-07T20:33:19.8932775Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8933023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8933111Z     T=1,
2025-05-07T20:33:19.8933278Z     D=7168,
2025-05-07T20:33:19.8933374Z     scale_ub=1200.0,
2025-05-07T20:33:19.8933534Z     contiguous=False,
2025-05-07T20:33:19.8933637Z     compiled=True,
2025-05-07T20:33:19.8933721Z )
2025-05-07T20:33:19.8933971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8934158Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.8934163Z 
2025-05-07T20:33:19.8934263Z     @given(
2025-05-07T20:33:19.8934419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8934539Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8934690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8934854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8934987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8935079Z     )
2025-05-07T20:33:19.8935355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8935465Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8935562Z         self,
2025-05-07T20:33:19.8935650Z         T: int,
2025-05-07T20:33:19.8935737Z         D: int,
2025-05-07T20:33:19.8935857Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8935958Z         contiguous: bool,
2025-05-07T20:33:19.8936055Z         compiled: bool,
2025-05-07T20:33:19.8936151Z     ) -> None:
2025-05-07T20:33:19.8936259Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8936341Z     
2025-05-07T20:33:19.8936543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8936629Z     
2025-05-07T20:33:19.8936734Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8936881Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8936985Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8937083Z         x0 = x[:, :D]
2025-05-07T20:33:19.8937174Z         x1 = x[:, D:]
2025-05-07T20:33:19.8937256Z     
2025-05-07T20:33:19.8937356Z         if contiguous:
2025-05-07T20:33:19.8937463Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8937567Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8937658Z     
2025-05-07T20:33:19.8937761Z         if scale_ub is not None:
2025-05-07T20:33:19.8937881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8938040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8938125Z             )
2025-05-07T20:33:19.8938213Z         else:
2025-05-07T20:33:19.8938327Z             scale_ub_tensor = None
2025-05-07T20:33:19.8938410Z     
2025-05-07T20:33:19.8938566Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8938670Z             op = silu_mul_quant
2025-05-07T20:33:19.8938767Z             if compiled:
2025-05-07T20:33:19.8938941Z                 op = torch.compile(op)
2025-05-07T20:33:19.8939063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8939147Z     
2025-05-07T20:33:19.8939257Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8939264Z 
2025-05-07T20:33:19.8939374Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8939522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8939643Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8939759Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8940174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.8940281Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.8940831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8941028Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8941430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8941686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8942164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8942274Z     kernel = self.compile(
2025-05-07T20:33:19.8942711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8942909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8943056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8943061Z 
2025-05-07T20:33:19.8943298Z self = <triton.compiler.compiler.ASTSource object at 0x7efd904290c0>
2025-05-07T20:33:19.8944172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8944750Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a0940>}
2025-05-07T20:33:19.8945584Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8945802Z context = <triton._C.libtriton.ir.context object at 0x7efd906d0c30>
2025-05-07T20:33:19.8945814Z 
2025-05-07T20:33:19.8946001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8946303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8946439Z                            module_map=module_map)
2025-05-07T20:33:19.8946626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8946740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8946836Z E       ^
2025-05-07T20:33:19.8947235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8947243Z 
2025-05-07T20:33:19.8947715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8947720Z 
2025-05-07T20:33:19.8947839Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8948090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8948184Z     T=1,
2025-05-07T20:33:19.8948272Z     D=7168,
2025-05-07T20:33:19.8948370Z     scale_ub=None,
2025-05-07T20:33:19.8948477Z     contiguous=False,
2025-05-07T20:33:19.8948572Z     compiled=True,
2025-05-07T20:33:19.8948658Z )
2025-05-07T20:33:19.8948960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8949147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.8949152Z 
2025-05-07T20:33:19.8949247Z     @given(
2025-05-07T20:33:19.8949385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8949503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8949641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8949776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8949906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8949998Z     )
2025-05-07T20:33:19.8950274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8950390Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8950521Z         self,
2025-05-07T20:33:19.8950609Z         T: int,
2025-05-07T20:33:19.8950702Z         D: int,
2025-05-07T20:33:19.8950813Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8950920Z         contiguous: bool,
2025-05-07T20:33:19.8951023Z         compiled: bool,
2025-05-07T20:33:19.8951113Z     ) -> None:
2025-05-07T20:33:19.8951220Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8951354Z     
2025-05-07T20:33:19.8951586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8951672Z     
2025-05-07T20:33:19.8951783Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8951925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8952027Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8952126Z         x0 = x[:, :D]
2025-05-07T20:33:19.8952217Z         x1 = x[:, D:]
2025-05-07T20:33:19.8952307Z     
2025-05-07T20:33:19.8952402Z         if contiguous:
2025-05-07T20:33:19.8952505Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8952619Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8952703Z     
2025-05-07T20:33:19.8952806Z         if scale_ub is not None:
2025-05-07T20:33:19.8952938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8953094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8953181Z             )
2025-05-07T20:33:19.8953274Z         else:
2025-05-07T20:33:19.8953385Z             scale_ub_tensor = None
2025-05-07T20:33:19.8953468Z     
2025-05-07T20:33:19.8953745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8953848Z             op = silu_mul_quant
2025-05-07T20:33:19.8953956Z             if compiled:
2025-05-07T20:33:19.8954091Z                 op = torch.compile(op)
2025-05-07T20:33:19.8954225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8954329Z     
2025-05-07T20:33:19.8954433Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.8954570Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.8954662Z     
2025-05-07T20:33:19.8954817Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8954934Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.8955054Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.8955194Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.8955351Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8955443Z     
2025-05-07T20:33:19.8955561Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.8955566Z 
2025-05-07T20:33:19.8955683Z moe/activation_test.py:126: 
2025-05-07T20:33:19.8955828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8955949Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.8956111Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.8956738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.8956857Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.8957324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8957577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8957995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.8958287Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8958736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:19.8959028Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.8959450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.8959691Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.8960081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.8960168Z     fn()
2025-05-07T20:33:19.8960629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.8960770Z     self.fn.run(
2025-05-07T20:33:19.8961188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8961307Z     kernel = self.compile(
2025-05-07T20:33:19.8961734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8961940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8962085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8962093Z 
2025-05-07T20:33:19.8962324Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90601750>
2025-05-07T20:33:19.8963200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8963778Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90b7eef0>}
2025-05-07T20:33:19.8964615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8964835Z context = <triton._C.libtriton.ir.context object at 0x7efd90a20bf0>
2025-05-07T20:33:19.8964842Z 
2025-05-07T20:33:19.8965036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8965335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8965459Z                            module_map=module_map)
2025-05-07T20:33:19.8965649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8965767Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.8965860Z E       ^
2025-05-07T20:33:19.8966269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8966274Z 
2025-05-07T20:33:19.8966733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8966738Z 
2025-05-07T20:33:19.8966865Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8967115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8967206Z     T=1,
2025-05-07T20:33:19.8967304Z     D=5120,
2025-05-07T20:33:19.8967400Z     scale_ub=1200.0,
2025-05-07T20:33:19.8967499Z     contiguous=False,
2025-05-07T20:33:19.8967653Z     compiled=True,
2025-05-07T20:33:19.8967740Z )
2025-05-07T20:33:19.8967984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8968179Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.8968186Z 
2025-05-07T20:33:19.8968277Z     @given(
2025-05-07T20:33:19.8968419Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8968534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8968667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8968806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8968935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8969019Z     )
2025-05-07T20:33:19.8969306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8969462Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8969557Z         self,
2025-05-07T20:33:19.8969644Z         T: int,
2025-05-07T20:33:19.8969735Z         D: int,
2025-05-07T20:33:19.8969852Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8969954Z         contiguous: bool,
2025-05-07T20:33:19.8970051Z         compiled: bool,
2025-05-07T20:33:19.8970191Z     ) -> None:
2025-05-07T20:33:19.8970342Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8970428Z     
2025-05-07T20:33:19.8970626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8970712Z     
2025-05-07T20:33:19.8970818Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8970964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8971068Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8971160Z         x0 = x[:, :D]
2025-05-07T20:33:19.8971257Z         x1 = x[:, D:]
2025-05-07T20:33:19.8971347Z     
2025-05-07T20:33:19.8971449Z         if contiguous:
2025-05-07T20:33:19.8971553Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8971654Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8971747Z     
2025-05-07T20:33:19.8971850Z         if scale_ub is not None:
2025-05-07T20:33:19.8971970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8972130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8972220Z             )
2025-05-07T20:33:19.8972310Z         else:
2025-05-07T20:33:19.8972424Z             scale_ub_tensor = None
2025-05-07T20:33:19.8972507Z     
2025-05-07T20:33:19.8972656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8972768Z             op = silu_mul_quant
2025-05-07T20:33:19.8972864Z             if compiled:
2025-05-07T20:33:19.8972982Z                 op = torch.compile(op)
2025-05-07T20:33:19.8973101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8973184Z     
2025-05-07T20:33:19.8973297Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8973301Z 
2025-05-07T20:33:19.8973412Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8973560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8973681Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8973795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8974206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.8974324Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.8974876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8974994Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8975396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8975649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8976043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8976233Z     kernel = self.compile(
2025-05-07T20:33:19.8976670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8976871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8982067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8982079Z 
2025-05-07T20:33:19.8982337Z self = <triton.compiler.compiler.ASTSource object at 0x7efd906416f0>
2025-05-07T20:33:19.8983208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8983914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90b7feb0>}
2025-05-07T20:33:19.8984753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8985022Z context = <triton._C.libtriton.ir.context object at 0x7efd90a02a30>
2025-05-07T20:33:19.8985068Z 
2025-05-07T20:33:19.8985264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.8985561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.8985692Z                            module_map=module_map)
2025-05-07T20:33:19.8985875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.8985990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.8986089Z E       ^
2025-05-07T20:33:19.8986490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.8986494Z 
2025-05-07T20:33:19.8986963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.8986975Z 
2025-05-07T20:33:19.8987094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.8987352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.8987448Z     T=1,
2025-05-07T20:33:19.8987537Z     D=5120,
2025-05-07T20:33:19.8987632Z     scale_ub=1200.0,
2025-05-07T20:33:19.8987737Z     contiguous=False,
2025-05-07T20:33:19.8987833Z     compiled=False,
2025-05-07T20:33:19.8987919Z )
2025-05-07T20:33:19.8988173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.8988364Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.8988372Z 
2025-05-07T20:33:19.8988467Z     @given(
2025-05-07T20:33:19.8988602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.8988719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.8988855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.8988989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.8989120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.8989215Z     )
2025-05-07T20:33:19.8989496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.8989603Z     def test_silu_mul_quant(
2025-05-07T20:33:19.8989698Z         self,
2025-05-07T20:33:19.8989786Z         T: int,
2025-05-07T20:33:19.8989876Z         D: int,
2025-05-07T20:33:19.8989993Z         scale_ub: Optional[float],
2025-05-07T20:33:19.8990097Z         contiguous: bool,
2025-05-07T20:33:19.8990200Z         compiled: bool,
2025-05-07T20:33:19.8990290Z     ) -> None:
2025-05-07T20:33:19.8990402Z         torch.manual_seed(2025)
2025-05-07T20:33:19.8990495Z     
2025-05-07T20:33:19.8990688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.8990846Z     
2025-05-07T20:33:19.8990959Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.8991102Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.8991206Z         x = x_sign * x_clamp
2025-05-07T20:33:19.8991306Z         x0 = x[:, :D]
2025-05-07T20:33:19.8991399Z         x1 = x[:, D:]
2025-05-07T20:33:19.8991483Z     
2025-05-07T20:33:19.8991585Z         if contiguous:
2025-05-07T20:33:19.8991689Z             x0 = x0.contiguous()
2025-05-07T20:33:19.8991799Z             x1 = x1.contiguous()
2025-05-07T20:33:19.8991882Z     
2025-05-07T20:33:19.8991984Z         if scale_ub is not None:
2025-05-07T20:33:19.8992112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.8992266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.8992403Z             )
2025-05-07T20:33:19.8992497Z         else:
2025-05-07T20:33:19.8992605Z             scale_ub_tensor = None
2025-05-07T20:33:19.8992688Z     
2025-05-07T20:33:19.8992848Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.8992952Z             op = silu_mul_quant
2025-05-07T20:33:19.8993048Z             if compiled:
2025-05-07T20:33:19.8993169Z                 op = torch.compile(op)
2025-05-07T20:33:19.8993336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8993467Z     
2025-05-07T20:33:19.8993713Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.8993718Z 
2025-05-07T20:33:19.8993830Z moe/activation_test.py:117: 
2025-05-07T20:33:19.8993984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8994100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.8994214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.8994831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.8994946Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.8995354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.8995613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.8995995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.8996117Z     kernel = self.compile(
2025-05-07T20:33:19.8996550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.8996748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.8996901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.8996906Z 
2025-05-07T20:33:19.8997138Z self = <triton.compiler.compiler.ASTSource object at 0x7efd904e1ae0>
2025-05-07T20:33:19.8998017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.8998585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd913a3e20>}
2025-05-07T20:33:19.8999430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.8999648Z context = <triton._C.libtriton.ir.context object at 0x7efd90a080b0>
2025-05-07T20:33:19.8999653Z 
2025-05-07T20:33:19.8999839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9000143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9000269Z                            module_map=module_map)
2025-05-07T20:33:19.9000506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9000630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9000719Z E       ^
2025-05-07T20:33:19.9001125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9001135Z 
2025-05-07T20:33:19.9001596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9001601Z 
2025-05-07T20:33:19.9001720Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9001979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9002069Z     T=16384,
2025-05-07T20:33:19.9002163Z     D=5120,
2025-05-07T20:33:19.9002260Z     scale_ub=1200.0,
2025-05-07T20:33:19.9002408Z     contiguous=False,
2025-05-07T20:33:19.9002511Z     compiled=True,
2025-05-07T20:33:19.9002598Z )
2025-05-07T20:33:19.9002846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9003055Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9003060Z 
2025-05-07T20:33:19.9003152Z     @given(
2025-05-07T20:33:19.9003333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9003496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9003631Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9003772Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9003902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9003988Z     )
2025-05-07T20:33:19.9004274Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9004382Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9004473Z         self,
2025-05-07T20:33:19.9004568Z         T: int,
2025-05-07T20:33:19.9004657Z         D: int,
2025-05-07T20:33:19.9004771Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9004883Z         contiguous: bool,
2025-05-07T20:33:19.9004982Z         compiled: bool,
2025-05-07T20:33:19.9005072Z     ) -> None:
2025-05-07T20:33:19.9005187Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9005274Z     
2025-05-07T20:33:19.9005472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9005563Z     
2025-05-07T20:33:19.9005668Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9005821Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9005924Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9006015Z         x0 = x[:, :D]
2025-05-07T20:33:19.9006117Z         x1 = x[:, D:]
2025-05-07T20:33:19.9006200Z     
2025-05-07T20:33:19.9006295Z         if contiguous:
2025-05-07T20:33:19.9006405Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9006510Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9006593Z     
2025-05-07T20:33:19.9006702Z         if scale_ub is not None:
2025-05-07T20:33:19.9006827Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9006980Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9007072Z             )
2025-05-07T20:33:19.9007158Z         else:
2025-05-07T20:33:19.9007275Z             scale_ub_tensor = None
2025-05-07T20:33:19.9007358Z     
2025-05-07T20:33:19.9007507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9007616Z             op = silu_mul_quant
2025-05-07T20:33:19.9007712Z             if compiled:
2025-05-07T20:33:19.9007825Z                 op = torch.compile(op)
2025-05-07T20:33:19.9007952Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9008035Z     
2025-05-07T20:33:19.9008138Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9008142Z 
2025-05-07T20:33:19.9008260Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9008408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9008530Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9008698Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9009111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9009224Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9009784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9009895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9010304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9010557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9010945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9011096Z     kernel = self.compile(
2025-05-07T20:33:19.9011528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9011734Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9011877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9011925Z 
2025-05-07T20:33:19.9012229Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90ae6dd0>
2025-05-07T20:33:19.9013124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9013835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd901288b0>}
2025-05-07T20:33:19.9014884Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9015154Z context = <triton._C.libtriton.ir.context object at 0x7efd901009b0>
2025-05-07T20:33:19.9015160Z 
2025-05-07T20:33:19.9015401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9015775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9015919Z                            module_map=module_map)
2025-05-07T20:33:19.9016108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9016224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9016313Z E       ^
2025-05-07T20:33:19.9016718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9016725Z 
2025-05-07T20:33:19.9017189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9017194Z 
2025-05-07T20:33:19.9017319Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9017568Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9017660Z     T=2048,
2025-05-07T20:33:19.9017754Z     D=7168,
2025-05-07T20:33:19.9017853Z     scale_ub=1200.0,
2025-05-07T20:33:19.9017953Z     contiguous=False,
2025-05-07T20:33:19.9018055Z     compiled=True,
2025-05-07T20:33:19.9018139Z )
2025-05-07T20:33:19.9018389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9018588Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9018592Z 
2025-05-07T20:33:19.9018681Z     @given(
2025-05-07T20:33:19.9018821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9018938Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9019069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9019259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9019392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9019483Z     )
2025-05-07T20:33:19.9019763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9019875Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9019968Z         self,
2025-05-07T20:33:19.9020056Z         T: int,
2025-05-07T20:33:19.9020144Z         D: int,
2025-05-07T20:33:19.9020264Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9020367Z         contiguous: bool,
2025-05-07T20:33:19.9020466Z         compiled: bool,
2025-05-07T20:33:19.9020565Z     ) -> None:
2025-05-07T20:33:19.9020674Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9020758Z     
2025-05-07T20:33:19.9020957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9021092Z     
2025-05-07T20:33:19.9021198Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9021350Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9021451Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9021553Z         x0 = x[:, :D]
2025-05-07T20:33:19.9021645Z         x1 = x[:, D:]
2025-05-07T20:33:19.9021775Z     
2025-05-07T20:33:19.9021879Z         if contiguous:
2025-05-07T20:33:19.9022025Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9022129Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9022222Z     
2025-05-07T20:33:19.9022327Z         if scale_ub is not None:
2025-05-07T20:33:19.9022447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9022607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9022695Z             )
2025-05-07T20:33:19.9022782Z         else:
2025-05-07T20:33:19.9022898Z             scale_ub_tensor = None
2025-05-07T20:33:19.9022984Z     
2025-05-07T20:33:19.9023143Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9023248Z             op = silu_mul_quant
2025-05-07T20:33:19.9023348Z             if compiled:
2025-05-07T20:33:19.9023476Z                 op = torch.compile(op)
2025-05-07T20:33:19.9023626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9023727Z     
2025-05-07T20:33:19.9024247Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9024258Z 
2025-05-07T20:33:19.9024466Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9024723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9024934Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9025118Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9025551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9025658Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9026217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9026334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9026737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9026989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9027380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9027487Z     kernel = self.compile(
2025-05-07T20:33:19.9027921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9028119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9028263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9028271Z 
2025-05-07T20:33:19.9028508Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9017f850>
2025-05-07T20:33:19.9029566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9030145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd90129090>}
2025-05-07T20:33:19.9030977Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9031193Z context = <triton._C.libtriton.ir.context object at 0x7efd901e48f0>
2025-05-07T20:33:19.9031206Z 
2025-05-07T20:33:19.9031394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9031766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9031903Z                            module_map=module_map)
2025-05-07T20:33:19.9032086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9032198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9032293Z E       ^
2025-05-07T20:33:19.9032820Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9032826Z 
2025-05-07T20:33:19.9033293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9033298Z 
2025-05-07T20:33:19.9033414Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9033747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9033845Z     T=1,
2025-05-07T20:33:19.9033938Z     D=5120,
2025-05-07T20:33:19.9034032Z     scale_ub=None,
2025-05-07T20:33:19.9034138Z     contiguous=False,
2025-05-07T20:33:19.9034234Z     compiled=False,
2025-05-07T20:33:19.9034318Z )
2025-05-07T20:33:19.9034570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9034760Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9034769Z 
2025-05-07T20:33:19.9034862Z     @given(
2025-05-07T20:33:19.9034998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9035113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9035249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9035382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9035512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9035602Z     )
2025-05-07T20:33:19.9035879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9035996Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9036083Z         self,
2025-05-07T20:33:19.9036171Z         T: int,
2025-05-07T20:33:19.9036263Z         D: int,
2025-05-07T20:33:19.9036377Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9036480Z         contiguous: bool,
2025-05-07T20:33:19.9036585Z         compiled: bool,
2025-05-07T20:33:19.9036675Z     ) -> None:
2025-05-07T20:33:19.9036785Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9036877Z     
2025-05-07T20:33:19.9037071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9037155Z     
2025-05-07T20:33:19.9037267Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9037408Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9037511Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9037611Z         x0 = x[:, :D]
2025-05-07T20:33:19.9037703Z         x1 = x[:, D:]
2025-05-07T20:33:19.9037798Z     
2025-05-07T20:33:19.9037894Z         if contiguous:
2025-05-07T20:33:19.9038001Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9038113Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9038197Z     
2025-05-07T20:33:19.9038356Z         if scale_ub is not None:
2025-05-07T20:33:19.9038484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9038636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9038723Z             )
2025-05-07T20:33:19.9038819Z         else:
2025-05-07T20:33:19.9038928Z             scale_ub_tensor = None
2025-05-07T20:33:19.9039012Z     
2025-05-07T20:33:19.9039165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9039272Z             op = silu_mul_quant
2025-05-07T20:33:19.9039374Z             if compiled:
2025-05-07T20:33:19.9039492Z                 op = torch.compile(op)
2025-05-07T20:33:19.9039612Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9039702Z     
2025-05-07T20:33:19.9039807Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9039857Z 
2025-05-07T20:33:19.9039967Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9040118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9040238Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9040351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9040917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9041115Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9041526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9041776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9042158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9042270Z     kernel = self.compile(
2025-05-07T20:33:19.9042704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9042912Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9043059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9043064Z 
2025-05-07T20:33:19.9043304Z self = <triton.compiler.compiler.ASTSource object at 0x7efd901544f0>
2025-05-07T20:33:19.9044400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9045116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd901297e0>}
2025-05-07T20:33:19.9046121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9046344Z context = <triton._C.libtriton.ir.context object at 0x7efd900e1330>
2025-05-07T20:33:19.9046349Z 
2025-05-07T20:33:19.9046537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9046841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9046970Z                            module_map=module_map)
2025-05-07T20:33:19.9047162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9047277Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9047366Z E       ^
2025-05-07T20:33:19.9047773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9047777Z 
2025-05-07T20:33:19.9048246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9048253Z 
2025-05-07T20:33:19.9048382Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9048684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9048774Z     T=4096,
2025-05-07T20:33:19.9048867Z     D=7168,
2025-05-07T20:33:19.9048965Z     scale_ub=1200.0,
2025-05-07T20:33:19.9049069Z     contiguous=False,
2025-05-07T20:33:19.9049173Z     compiled=False,
2025-05-07T20:33:19.9049261Z )
2025-05-07T20:33:19.9049508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9049717Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9049722Z 
2025-05-07T20:33:19.9049811Z     @given(
2025-05-07T20:33:19.9049950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9050064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9050241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9050380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9050510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9050599Z     )
2025-05-07T20:33:19.9050884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9050992Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9051161Z         self,
2025-05-07T20:33:19.9051257Z         T: int,
2025-05-07T20:33:19.9051411Z         D: int,
2025-05-07T20:33:19.9051525Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9051632Z         contiguous: bool,
2025-05-07T20:33:19.9051731Z         compiled: bool,
2025-05-07T20:33:19.9051831Z     ) -> None:
2025-05-07T20:33:19.9051939Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9052023Z     
2025-05-07T20:33:19.9052226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9052311Z     
2025-05-07T20:33:19.9052421Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9052569Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9052670Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9052763Z         x0 = x[:, :D]
2025-05-07T20:33:19.9052872Z         x1 = x[:, D:]
2025-05-07T20:33:19.9052956Z     
2025-05-07T20:33:19.9053053Z         if contiguous:
2025-05-07T20:33:19.9053165Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9053272Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9053365Z     
2025-05-07T20:33:19.9053472Z         if scale_ub is not None:
2025-05-07T20:33:19.9053593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9053766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9053867Z             )
2025-05-07T20:33:19.9053968Z         else:
2025-05-07T20:33:19.9054094Z             scale_ub_tensor = None
2025-05-07T20:33:19.9054178Z     
2025-05-07T20:33:19.9054326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9054441Z             op = silu_mul_quant
2025-05-07T20:33:19.9054538Z             if compiled:
2025-05-07T20:33:19.9054653Z                 op = torch.compile(op)
2025-05-07T20:33:19.9054781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9054866Z     
2025-05-07T20:33:19.9054970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9054981Z 
2025-05-07T20:33:19.9055094Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9055246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9055377Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9055493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9056057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9056178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9056585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9056851Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9057292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9057403Z     kernel = self.compile(
2025-05-07T20:33:19.9057847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9058052Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9058196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9058201Z 
2025-05-07T20:33:19.9058442Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90169cf0>
2025-05-07T20:33:19.9059317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9059943Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9012a200>}
2025-05-07T20:33:19.9060782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9061095Z context = <triton._C.libtriton.ir.context object at 0x7efd9001c770>
2025-05-07T20:33:19.9061100Z 
2025-05-07T20:33:19.9061289Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9061588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9061716Z                            module_map=module_map)
2025-05-07T20:33:19.9061901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9062016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9062111Z E       ^
2025-05-07T20:33:19.9062514Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9062519Z 
2025-05-07T20:33:19.9062991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9062999Z 
2025-05-07T20:33:19.9063123Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9063377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9063474Z     T=16384,
2025-05-07T20:33:19.9063564Z     D=7168,
2025-05-07T20:33:19.9063661Z     scale_ub=None,
2025-05-07T20:33:19.9063766Z     contiguous=True,
2025-05-07T20:33:19.9063863Z     compiled=True,
2025-05-07T20:33:19.9063953Z )
2025-05-07T20:33:19.9064199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9064398Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.9064403Z 
2025-05-07T20:33:19.9064498Z     @given(
2025-05-07T20:33:19.9064635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9064750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9064888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9065023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9065164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9065250Z     )
2025-05-07T20:33:19.9065528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9065642Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9065732Z         self,
2025-05-07T20:33:19.9065820Z         T: int,
2025-05-07T20:33:19.9065915Z         D: int,
2025-05-07T20:33:19.9066029Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9066132Z         contiguous: bool,
2025-05-07T20:33:19.9066238Z         compiled: bool,
2025-05-07T20:33:19.9066327Z     ) -> None:
2025-05-07T20:33:19.9066435Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9066526Z     
2025-05-07T20:33:19.9066770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9066855Z     
2025-05-07T20:33:19.9066965Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9067107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9067219Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9067314Z         x0 = x[:, :D]
2025-05-07T20:33:19.9067405Z         x1 = x[:, D:]
2025-05-07T20:33:19.9067495Z     
2025-05-07T20:33:19.9067591Z         if contiguous:
2025-05-07T20:33:19.9067696Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9067805Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9067888Z     
2025-05-07T20:33:19.9067992Z         if scale_ub is not None:
2025-05-07T20:33:19.9068120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9068320Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9068409Z             )
2025-05-07T20:33:19.9068504Z         else:
2025-05-07T20:33:19.9068611Z             scale_ub_tensor = None
2025-05-07T20:33:19.9068704Z     
2025-05-07T20:33:19.9068854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9068957Z             op = silu_mul_quant
2025-05-07T20:33:19.9069062Z             if compiled:
2025-05-07T20:33:19.9069223Z                 op = torch.compile(op)
2025-05-07T20:33:19.9069387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9069479Z     
2025-05-07T20:33:19.9069586Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9069591Z 
2025-05-07T20:33:19.9069703Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9069859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9069977Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9070099Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9070518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9070625Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9071193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9071306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9071715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9071976Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9072362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9072478Z     kernel = self.compile(
2025-05-07T20:33:19.9072910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9073112Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9073263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9073273Z 
2025-05-07T20:33:19.9073616Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90071e10>
2025-05-07T20:33:19.9074498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9075076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9012b760>}
2025-05-07T20:33:19.9075915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9076147Z context = <triton._C.libtriton.ir.context object at 0x7efd90751570>
2025-05-07T20:33:19.9076152Z 
2025-05-07T20:33:19.9076394Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9076704Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9076828Z                            module_map=module_map)
2025-05-07T20:33:19.9077018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9077142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9077233Z E       ^
2025-05-07T20:33:19.9077635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9077649Z 
2025-05-07T20:33:19.9078113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9078118Z 
2025-05-07T20:33:19.9078284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9078541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9078631Z     T=4096,
2025-05-07T20:33:19.9078722Z     D=5120,
2025-05-07T20:33:19.9078823Z     scale_ub=None,
2025-05-07T20:33:19.9078923Z     contiguous=False,
2025-05-07T20:33:19.9079018Z     compiled=True,
2025-05-07T20:33:19.9079108Z )
2025-05-07T20:33:19.9079399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9079642Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9079647Z 
2025-05-07T20:33:19.9079737Z     @given(
2025-05-07T20:33:19.9079871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9079993Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9080125Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9080258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9080398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9080484Z     )
2025-05-07T20:33:19.9080774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9080883Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9080971Z         self,
2025-05-07T20:33:19.9081066Z         T: int,
2025-05-07T20:33:19.9081154Z         D: int,
2025-05-07T20:33:19.9081272Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9081380Z         contiguous: bool,
2025-05-07T20:33:19.9081481Z         compiled: bool,
2025-05-07T20:33:19.9081570Z     ) -> None:
2025-05-07T20:33:19.9081686Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9081770Z     
2025-05-07T20:33:19.9081961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9082052Z     
2025-05-07T20:33:19.9082161Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9082303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9082415Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9082507Z         x0 = x[:, :D]
2025-05-07T20:33:19.9082605Z         x1 = x[:, D:]
2025-05-07T20:33:19.9082688Z     
2025-05-07T20:33:19.9082786Z         if contiguous:
2025-05-07T20:33:19.9082896Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9082998Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9083081Z     
2025-05-07T20:33:19.9083190Z         if scale_ub is not None:
2025-05-07T20:33:19.9083316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9083475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9083569Z             )
2025-05-07T20:33:19.9083657Z         else:
2025-05-07T20:33:19.9083765Z             scale_ub_tensor = None
2025-05-07T20:33:19.9083854Z     
2025-05-07T20:33:19.9084001Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9084110Z             op = silu_mul_quant
2025-05-07T20:33:19.9084208Z             if compiled:
2025-05-07T20:33:19.9084325Z                 op = torch.compile(op)
2025-05-07T20:33:19.9084453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9084537Z     
2025-05-07T20:33:19.9084642Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9084701Z 
2025-05-07T20:33:19.9084819Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9084964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9085080Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9085205Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9085621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9085736Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9086296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9086408Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9086821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9087152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9087545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9087660Z     kernel = self.compile(
2025-05-07T20:33:19.9088134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9088384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9088530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9088535Z 
2025-05-07T20:33:19.9088768Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079b4f0>
2025-05-07T20:33:19.9089649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9090226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074c280>}
2025-05-07T20:33:19.9091077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9091300Z context = <triton._C.libtriton.ir.context object at 0x7efd907a0230>
2025-05-07T20:33:19.9091305Z 
2025-05-07T20:33:19.9091500Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9091801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9091923Z                            module_map=module_map)
2025-05-07T20:33:19.9092119Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9092233Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9092323Z E       ^
2025-05-07T20:33:19.9092733Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9092737Z 
2025-05-07T20:33:19.9093201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9093211Z 
2025-05-07T20:33:19.9093336Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9093613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9093716Z     T=4096,
2025-05-07T20:33:19.9093822Z     D=5120,
2025-05-07T20:33:19.9093918Z     scale_ub=1200.0,
2025-05-07T20:33:19.9094017Z     contiguous=False,
2025-05-07T20:33:19.9094123Z     compiled=False,
2025-05-07T20:33:19.9094206Z )
2025-05-07T20:33:19.9094453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9094666Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9094671Z 
2025-05-07T20:33:19.9094811Z     @given(
2025-05-07T20:33:19.9094955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9095069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9095200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9095347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9095478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9095563Z     )
2025-05-07T20:33:19.9095847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9095956Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9096048Z         self,
2025-05-07T20:33:19.9096137Z         T: int,
2025-05-07T20:33:19.9096225Z         D: int,
2025-05-07T20:33:19.9096343Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9096491Z         contiguous: bool,
2025-05-07T20:33:19.9096589Z         compiled: bool,
2025-05-07T20:33:19.9096686Z     ) -> None:
2025-05-07T20:33:19.9096797Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9096881Z     
2025-05-07T20:33:19.9097080Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9097170Z     
2025-05-07T20:33:19.9097275Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9097472Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9097615Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9097709Z         x0 = x[:, :D]
2025-05-07T20:33:19.9097807Z         x1 = x[:, D:]
2025-05-07T20:33:19.9097890Z     
2025-05-07T20:33:19.9097992Z         if contiguous:
2025-05-07T20:33:19.9098101Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9098202Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9098293Z     
2025-05-07T20:33:19.9098397Z         if scale_ub is not None:
2025-05-07T20:33:19.9098521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9098680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9098767Z             )
2025-05-07T20:33:19.9098860Z         else:
2025-05-07T20:33:19.9098975Z             scale_ub_tensor = None
2025-05-07T20:33:19.9099058Z     
2025-05-07T20:33:19.9099206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9099315Z             op = silu_mul_quant
2025-05-07T20:33:19.9099418Z             if compiled:
2025-05-07T20:33:19.9099540Z                 op = torch.compile(op)
2025-05-07T20:33:19.9099660Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9099742Z     
2025-05-07T20:33:19.9099849Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9099854Z 
2025-05-07T20:33:19.9099965Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9100112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9100238Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9100354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9100918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9101038Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9101443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9101709Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9102094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9102201Z     kernel = self.compile(
2025-05-07T20:33:19.9102645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9102844Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9102995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9103003Z 
2025-05-07T20:33:19.9103235Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079a830>
2025-05-07T20:33:19.9104158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9104743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074d000>}
2025-05-07T20:33:19.9105585Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9105809Z context = <triton._C.libtriton.ir.context object at 0x7efd907e57b0>
2025-05-07T20:33:19.9105858Z 
2025-05-07T20:33:19.9106047Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9106349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9106479Z                            module_map=module_map)
2025-05-07T20:33:19.9106665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9106833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9106922Z E       ^
2025-05-07T20:33:19.9107368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9107373Z 
2025-05-07T20:33:19.9107846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9107851Z 
2025-05-07T20:33:19.9107970Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9108227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9108320Z     T=4096,
2025-05-07T20:33:19.9108408Z     D=5120,
2025-05-07T20:33:19.9108511Z     scale_ub=1200.0,
2025-05-07T20:33:19.9108613Z     contiguous=False,
2025-05-07T20:33:19.9108710Z     compiled=True,
2025-05-07T20:33:19.9108806Z )
2025-05-07T20:33:19.9109052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9109249Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9109261Z 
2025-05-07T20:33:19.9109356Z     @given(
2025-05-07T20:33:19.9109490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9109610Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9109742Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9109876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9110017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9110102Z     )
2025-05-07T20:33:19.9110385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9110500Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9110588Z         self,
2025-05-07T20:33:19.9110679Z         T: int,
2025-05-07T20:33:19.9110777Z         D: int,
2025-05-07T20:33:19.9110890Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9110996Z         contiguous: bool,
2025-05-07T20:33:19.9111100Z         compiled: bool,
2025-05-07T20:33:19.9111193Z     ) -> None:
2025-05-07T20:33:19.9111309Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9111393Z     
2025-05-07T20:33:19.9111585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9111676Z     
2025-05-07T20:33:19.9111781Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9111926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9112032Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9112124Z         x0 = x[:, :D]
2025-05-07T20:33:19.9112216Z         x1 = x[:, D:]
2025-05-07T20:33:19.9112307Z     
2025-05-07T20:33:19.9112404Z         if contiguous:
2025-05-07T20:33:19.9112508Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9112673Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9112759Z     
2025-05-07T20:33:19.9112863Z         if scale_ub is not None:
2025-05-07T20:33:19.9112993Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9113147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9113246Z             )
2025-05-07T20:33:19.9113335Z         else:
2025-05-07T20:33:19.9113444Z             scale_ub_tensor = None
2025-05-07T20:33:19.9113671Z     
2025-05-07T20:33:19.9113819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9113923Z             op = silu_mul_quant
2025-05-07T20:33:19.9114028Z             if compiled:
2025-05-07T20:33:19.9114141Z                 op = torch.compile(op)
2025-05-07T20:33:19.9114288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9114449Z     
2025-05-07T20:33:19.9114558Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9114562Z 
2025-05-07T20:33:19.9114681Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9114831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9114946Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9115068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9115566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9115674Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9116238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9116350Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9116759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9117015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9117404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9117518Z     kernel = self.compile(
2025-05-07T20:33:19.9117949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9118155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9118306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9118311Z 
2025-05-07T20:33:19.9118544Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90728cd0>
2025-05-07T20:33:19.9119417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9119987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074c700>}
2025-05-07T20:33:19.9120834Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9121057Z context = <triton._C.libtriton.ir.context object at 0x7efd90291d70>
2025-05-07T20:33:19.9121062Z 
2025-05-07T20:33:19.9121249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9121552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9121674Z                            module_map=module_map)
2025-05-07T20:33:19.9121857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9121976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9122067Z E       ^
2025-05-07T20:33:19.9122473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9122556Z 
2025-05-07T20:33:19.9123022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9123026Z 
2025-05-07T20:33:19.9123148Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9128945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9129050Z     T=2048,
2025-05-07T20:33:19.9129148Z     D=7168,
2025-05-07T20:33:19.9129249Z     scale_ub=1200.0,
2025-05-07T20:33:19.9129349Z     contiguous=False,
2025-05-07T20:33:19.9129454Z     compiled=False,
2025-05-07T20:33:19.9129539Z )
2025-05-07T20:33:19.9129788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9129998Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9130200Z 
2025-05-07T20:33:19.9130293Z     @given(
2025-05-07T20:33:19.9130432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9130560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9130693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9130835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9131052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9131204Z     )
2025-05-07T20:33:19.9131500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9131609Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9131699Z         self,
2025-05-07T20:33:19.9131794Z         T: int,
2025-05-07T20:33:19.9131882Z         D: int,
2025-05-07T20:33:19.9131996Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9132113Z         contiguous: bool,
2025-05-07T20:33:19.9132213Z         compiled: bool,
2025-05-07T20:33:19.9132309Z     ) -> None:
2025-05-07T20:33:19.9132426Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9132511Z     
2025-05-07T20:33:19.9132709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9132803Z     
2025-05-07T20:33:19.9132912Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9133063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9133168Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9133264Z         x0 = x[:, :D]
2025-05-07T20:33:19.9133364Z         x1 = x[:, D:]
2025-05-07T20:33:19.9133449Z     
2025-05-07T20:33:19.9133545Z         if contiguous:
2025-05-07T20:33:19.9133655Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9133757Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9133841Z     
2025-05-07T20:33:19.9133952Z         if scale_ub is not None:
2025-05-07T20:33:19.9134073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9134230Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9134327Z             )
2025-05-07T20:33:19.9134416Z         else:
2025-05-07T20:33:19.9134530Z             scale_ub_tensor = None
2025-05-07T20:33:19.9134613Z     
2025-05-07T20:33:19.9134765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9134875Z             op = silu_mul_quant
2025-05-07T20:33:19.9134973Z             if compiled:
2025-05-07T20:33:19.9135090Z                 op = torch.compile(op)
2025-05-07T20:33:19.9135224Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9135309Z     
2025-05-07T20:33:19.9135414Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9135419Z 
2025-05-07T20:33:19.9135537Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9135687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9135809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9135925Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9136491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9136614Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9137098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9137355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9137756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9137866Z     kernel = self.compile(
2025-05-07T20:33:19.9138310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9138513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9138657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9138662Z 
2025-05-07T20:33:19.9138952Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9079c6a0>
2025-05-07T20:33:19.9139831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9140455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074d240>}
2025-05-07T20:33:19.9141335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9141553Z context = <triton._C.libtriton.ir.context object at 0x7efd9021b1b0>
2025-05-07T20:33:19.9141564Z 
2025-05-07T20:33:19.9141752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9142055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9142185Z                            module_map=module_map)
2025-05-07T20:33:19.9142371Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9142486Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9142582Z E       ^
2025-05-07T20:33:19.9142985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9142994Z 
2025-05-07T20:33:19.9143466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9143471Z 
2025-05-07T20:33:19.9143592Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9143843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9143938Z     T=1,
2025-05-07T20:33:19.9144026Z     D=7168,
2025-05-07T20:33:19.9144124Z     scale_ub=None,
2025-05-07T20:33:19.9144232Z     contiguous=True,
2025-05-07T20:33:19.9144329Z     compiled=False,
2025-05-07T20:33:19.9144415Z )
2025-05-07T20:33:19.9144673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9144863Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9144868Z 
2025-05-07T20:33:19.9144964Z     @given(
2025-05-07T20:33:19.9145102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9145219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9145358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9145492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9145624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9145718Z     )
2025-05-07T20:33:19.9146001Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9146109Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9146212Z         self,
2025-05-07T20:33:19.9146301Z         T: int,
2025-05-07T20:33:19.9146396Z         D: int,
2025-05-07T20:33:19.9146561Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9146666Z         contiguous: bool,
2025-05-07T20:33:19.9146772Z         compiled: bool,
2025-05-07T20:33:19.9146864Z     ) -> None:
2025-05-07T20:33:19.9146973Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9147067Z     
2025-05-07T20:33:19.9147263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9147349Z     
2025-05-07T20:33:19.9147462Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9147605Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9147709Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9147809Z         x0 = x[:, :D]
2025-05-07T20:33:19.9147903Z         x1 = x[:, D:]
2025-05-07T20:33:19.9147993Z     
2025-05-07T20:33:19.9148091Z         if contiguous:
2025-05-07T20:33:19.9148196Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9148353Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9148436Z     
2025-05-07T20:33:19.9148539Z         if scale_ub is not None:
2025-05-07T20:33:19.9148670Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9148825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9148913Z             )
2025-05-07T20:33:19.9149008Z         else:
2025-05-07T20:33:19.9149161Z             scale_ub_tensor = None
2025-05-07T20:33:19.9149285Z     
2025-05-07T20:33:19.9149448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9149551Z             op = silu_mul_quant
2025-05-07T20:33:19.9149649Z             if compiled:
2025-05-07T20:33:19.9149770Z                 op = torch.compile(op)
2025-05-07T20:33:19.9149891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9149981Z     
2025-05-07T20:33:19.9150086Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9150091Z 
2025-05-07T20:33:19.9150207Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9150358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9150476Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9150593Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9151169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9151286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9151702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9151955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9152339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9152452Z     kernel = self.compile(
2025-05-07T20:33:19.9152887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9153090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9153242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9153247Z 
2025-05-07T20:33:19.9153481Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90230d30>
2025-05-07T20:33:19.9154450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9155022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074e050>}
2025-05-07T20:33:19.9155866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9156087Z context = <triton._C.libtriton.ir.context object at 0x7efd9031af70>
2025-05-07T20:33:19.9156144Z 
2025-05-07T20:33:19.9156334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9156642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9156770Z                            module_map=module_map)
2025-05-07T20:33:19.9156960Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9157074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9157165Z E       ^
2025-05-07T20:33:19.9157573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9157579Z 
2025-05-07T20:33:19.9158043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9158091Z 
2025-05-07T20:33:19.9158211Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9158472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9158561Z     T=16384,
2025-05-07T20:33:19.9158656Z     D=7168,
2025-05-07T20:33:19.9158752Z     scale_ub=1200.0,
2025-05-07T20:33:19.9158851Z     contiguous=False,
2025-05-07T20:33:19.9159014Z     compiled=True,
2025-05-07T20:33:19.9159099Z )
2025-05-07T20:33:19.9159417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9159626Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9159631Z 
2025-05-07T20:33:19.9159721Z     @given(
2025-05-07T20:33:19.9159855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9159978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9160111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9160254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9160384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9160470Z     )
2025-05-07T20:33:19.9160761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9160868Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9160955Z         self,
2025-05-07T20:33:19.9161054Z         T: int,
2025-05-07T20:33:19.9161142Z         D: int,
2025-05-07T20:33:19.9161259Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9161369Z         contiguous: bool,
2025-05-07T20:33:19.9161468Z         compiled: bool,
2025-05-07T20:33:19.9161557Z     ) -> None:
2025-05-07T20:33:19.9161673Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9161757Z     
2025-05-07T20:33:19.9161956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9162045Z     
2025-05-07T20:33:19.9162152Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9162304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9162409Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9162501Z         x0 = x[:, :D]
2025-05-07T20:33:19.9162605Z         x1 = x[:, D:]
2025-05-07T20:33:19.9162690Z     
2025-05-07T20:33:19.9162788Z         if contiguous:
2025-05-07T20:33:19.9162900Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9163004Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9163091Z     
2025-05-07T20:33:19.9163206Z         if scale_ub is not None:
2025-05-07T20:33:19.9163330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9163494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9163582Z             )
2025-05-07T20:33:19.9163671Z         else:
2025-05-07T20:33:19.9163789Z             scale_ub_tensor = None
2025-05-07T20:33:19.9163873Z     
2025-05-07T20:33:19.9164023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9164133Z             op = silu_mul_quant
2025-05-07T20:33:19.9164238Z             if compiled:
2025-05-07T20:33:19.9164353Z                 op = torch.compile(op)
2025-05-07T20:33:19.9164482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9164619Z     
2025-05-07T20:33:19.9164726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9164731Z 
2025-05-07T20:33:19.9164851Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9164997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9165125Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9165240Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9165655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9165770Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9166330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9166490Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9166904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9167162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9167553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9167707Z     kernel = self.compile(
2025-05-07T20:33:19.9168180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9168390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9168533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9168538Z 
2025-05-07T20:33:19.9168776Z self = <triton.compiler.compiler.ASTSource object at 0x7efd90344610>
2025-05-07T20:33:19.9169653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9170229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074f490>}
2025-05-07T20:33:19.9171077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9171296Z context = <triton._C.libtriton.ir.context object at 0x7efd9035b9b0>
2025-05-07T20:33:19.9171301Z 
2025-05-07T20:33:19.9171494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9171796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9171922Z                            module_map=module_map)
2025-05-07T20:33:19.9172112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9172228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9172326Z E       ^
2025-05-07T20:33:19.9172726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9172733Z 
2025-05-07T20:33:19.9173202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9173208Z 
2025-05-07T20:33:19.9173332Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9173584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9173683Z     T=1,
2025-05-07T20:33:19.9173771Z     D=7168,
2025-05-07T20:33:19.9173867Z     scale_ub=None,
2025-05-07T20:33:19.9173973Z     contiguous=False,
2025-05-07T20:33:19.9174069Z     compiled=False,
2025-05-07T20:33:19.9174156Z )
2025-05-07T20:33:19.9174408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9174645Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9174651Z 
2025-05-07T20:33:19.9174742Z     @given(
2025-05-07T20:33:19.9174888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9175002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9175139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9175284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9175414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9175507Z     )
2025-05-07T20:33:19.9175787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9175895Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9175989Z         self,
2025-05-07T20:33:19.9176076Z         T: int,
2025-05-07T20:33:19.9176210Z         D: int,
2025-05-07T20:33:19.9176329Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9176434Z         contiguous: bool,
2025-05-07T20:33:19.9176536Z         compiled: bool,
2025-05-07T20:33:19.9176636Z     ) -> None:
2025-05-07T20:33:19.9176746Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9176830Z     
2025-05-07T20:33:19.9177029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9177164Z     
2025-05-07T20:33:19.9177277Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9177460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9177565Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9177663Z         x0 = x[:, :D]
2025-05-07T20:33:19.9177754Z         x1 = x[:, D:]
2025-05-07T20:33:19.9177838Z     
2025-05-07T20:33:19.9177940Z         if contiguous:
2025-05-07T20:33:19.9178046Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9178150Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9178239Z     
2025-05-07T20:33:19.9178347Z         if scale_ub is not None:
2025-05-07T20:33:19.9178470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9178636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9178723Z             )
2025-05-07T20:33:19.9178818Z         else:
2025-05-07T20:33:19.9178926Z             scale_ub_tensor = None
2025-05-07T20:33:19.9179010Z     
2025-05-07T20:33:19.9179167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9179277Z             op = silu_mul_quant
2025-05-07T20:33:19.9179378Z             if compiled:
2025-05-07T20:33:19.9179501Z                 op = torch.compile(op)
2025-05-07T20:33:19.9179623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9179707Z     
2025-05-07T20:33:19.9179821Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9179826Z 
2025-05-07T20:33:19.9179938Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9180085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9180210Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9180324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9180899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9181011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9181420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9181688Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9182077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9182192Z     kernel = self.compile(
2025-05-07T20:33:19.9182628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9182827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9182980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9182985Z 
2025-05-07T20:33:19.9183269Z self = <triton.compiler.compiler.ASTSource object at 0x7efd9003ba30>
2025-05-07T20:33:19.9184154Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9184729Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efd9074f7f0>}
2025-05-07T20:33:19.9185568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9185835Z context = <triton._C.libtriton.ir.context object at 0x7efd903a9430>
2025-05-07T20:33:19.9185840Z 
2025-05-07T20:33:19.9186030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9186339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9186465Z                            module_map=module_map)
2025-05-07T20:33:19.9186694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9186854Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9186944Z E       ^
2025-05-07T20:33:19.9187345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9187357Z 
2025-05-07T20:33:19.9187822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9187827Z 
2025-05-07T20:33:19.9187946Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9188208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9188298Z     T=2048,
2025-05-07T20:33:19.9188391Z     D=7168,
2025-05-07T20:33:19.9188497Z     scale_ub=None,
2025-05-07T20:33:19.9188596Z     contiguous=False,
2025-05-07T20:33:19.9188692Z     compiled=True,
2025-05-07T20:33:19.9188783Z )
2025-05-07T20:33:19.9189029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9189239Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9189244Z 
2025-05-07T20:33:19.9189331Z     @given(
2025-05-07T20:33:19.9189467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9189586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9189717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9189851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9189987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9190075Z     )
2025-05-07T20:33:19.9190355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9190474Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9190561Z         self,
2025-05-07T20:33:19.9190657Z         T: int,
2025-05-07T20:33:19.9190745Z         D: int,
2025-05-07T20:33:19.9190857Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9190970Z         contiguous: bool,
2025-05-07T20:33:19.9191073Z         compiled: bool,
2025-05-07T20:33:19.9191162Z     ) -> None:
2025-05-07T20:33:19.9191279Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9191364Z     
2025-05-07T20:33:19.9191556Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9191647Z     
2025-05-07T20:33:19.9191753Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9191895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9192001Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9192095Z         x0 = x[:, :D]
2025-05-07T20:33:19.9192193Z         x1 = x[:, D:]
2025-05-07T20:33:19.9192277Z     
2025-05-07T20:33:19.9192373Z         if contiguous:
2025-05-07T20:33:19.9192536Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9192641Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9192726Z     
2025-05-07T20:33:19.9192837Z         if scale_ub is not None:
2025-05-07T20:33:19.9192957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9193121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9193214Z             )
2025-05-07T20:33:19.9193301Z         else:
2025-05-07T20:33:19.9193409Z             scale_ub_tensor = None
2025-05-07T20:33:19.9193565Z     
2025-05-07T20:33:19.9193726Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9193842Z             op = silu_mul_quant
2025-05-07T20:33:19.9193967Z             if compiled:
2025-05-07T20:33:19.9194081Z                 op = torch.compile(op)
2025-05-07T20:33:19.9194272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9194356Z     
2025-05-07T20:33:19.9194460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9194465Z 
2025-05-07T20:33:19.9194587Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9194733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9194850Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9195043Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9195498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9195620Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9196176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9196288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9196701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9196960Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9197352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9197468Z     kernel = self.compile(
2025-05-07T20:33:19.9197900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9198113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9198257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9198262Z 
2025-05-07T20:33:19.9198496Z self = <triton.compiler.compiler.ASTSource object at 0x7efd902a8520>
2025-05-07T20:33:19.9199375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9199954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1caf0>}
2025-05-07T20:33:19.9200796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9201027Z context = <triton._C.libtriton.ir.context object at 0x7efacbf62330>
2025-05-07T20:33:19.9201032Z 
2025-05-07T20:33:19.9201220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9201528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9201651Z                            module_map=module_map)
2025-05-07T20:33:19.9201845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9201963Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9202052Z E       ^
2025-05-07T20:33:19.9202509Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9202515Z 
2025-05-07T20:33:19.9202981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9202989Z 
2025-05-07T20:33:19.9203117Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9203370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9203458Z     T=4096,
2025-05-07T20:33:19.9203551Z     D=7168,
2025-05-07T20:33:19.9203646Z     scale_ub=None,
2025-05-07T20:33:19.9203745Z     contiguous=False,
2025-05-07T20:33:19.9203857Z     compiled=True,
2025-05-07T20:33:19.9203954Z )
2025-05-07T20:33:19.9204225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9204471Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9204476Z 
2025-05-07T20:33:19.9204566Z     @given(
2025-05-07T20:33:19.9204708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9204826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9204957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9205141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9205311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9205399Z     )
2025-05-07T20:33:19.9205683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9205792Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9205879Z         self,
2025-05-07T20:33:19.9205972Z         T: int,
2025-05-07T20:33:19.9206060Z         D: int,
2025-05-07T20:33:19.9206179Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9206280Z         contiguous: bool,
2025-05-07T20:33:19.9206381Z         compiled: bool,
2025-05-07T20:33:19.9206475Z     ) -> None:
2025-05-07T20:33:19.9206584Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9206672Z     
2025-05-07T20:33:19.9206872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9206957Z     
2025-05-07T20:33:19.9207062Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9207213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9207321Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9207412Z         x0 = x[:, :D]
2025-05-07T20:33:19.9207510Z         x1 = x[:, D:]
2025-05-07T20:33:19.9207592Z     
2025-05-07T20:33:19.9207687Z         if contiguous:
2025-05-07T20:33:19.9207799Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9207900Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9207987Z     
2025-05-07T20:33:19.9208090Z         if scale_ub is not None:
2025-05-07T20:33:19.9208209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9208374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9208460Z             )
2025-05-07T20:33:19.9208547Z         else:
2025-05-07T20:33:19.9208661Z             scale_ub_tensor = None
2025-05-07T20:33:19.9208744Z     
2025-05-07T20:33:19.9208892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9209002Z             op = silu_mul_quant
2025-05-07T20:33:19.9209101Z             if compiled:
2025-05-07T20:33:19.9209217Z                 op = torch.compile(op)
2025-05-07T20:33:19.9209345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9209428Z     
2025-05-07T20:33:19.9209538Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9209543Z 
2025-05-07T20:33:19.9209653Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9209798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9209923Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9210036Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9210454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9210567Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9211178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9211302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9211713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9211968Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9212359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9212468Z     kernel = self.compile(
2025-05-07T20:33:19.9212905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9213156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9213304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9213309Z 
2025-05-07T20:33:19.9213548Z self = <triton.compiler.compiler.ASTSource object at 0x7efd903dc4f0>
2025-05-07T20:33:19.9214467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9215110Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1c280>}
2025-05-07T20:33:19.9215946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9216169Z context = <triton._C.libtriton.ir.context object at 0x7efacbfb11b0>
2025-05-07T20:33:19.9216174Z 
2025-05-07T20:33:19.9216371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9216670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9216800Z                            module_map=module_map)
2025-05-07T20:33:19.9216989Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9217102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9217198Z E       ^
2025-05-07T20:33:19.9217598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9217603Z 
2025-05-07T20:33:19.9218065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9218079Z 
2025-05-07T20:33:19.9218200Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9218451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9218553Z     T=16384,
2025-05-07T20:33:19.9218644Z     D=5120,
2025-05-07T20:33:19.9218739Z     scale_ub=1200.0,
2025-05-07T20:33:19.9218846Z     contiguous=False,
2025-05-07T20:33:19.9218945Z     compiled=False,
2025-05-07T20:33:19.9219032Z )
2025-05-07T20:33:19.9219288Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9219493Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9219498Z 
2025-05-07T20:33:19.9219586Z     @given(
2025-05-07T20:33:19.9219726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9219839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9219979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9220116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9220251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9220346Z     )
2025-05-07T20:33:19.9221359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9221478Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9221571Z         self,
2025-05-07T20:33:19.9221661Z         T: int,
2025-05-07T20:33:19.9221747Z         D: int,
2025-05-07T20:33:19.9221871Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9221975Z         contiguous: bool,
2025-05-07T20:33:19.9222079Z         compiled: bool,
2025-05-07T20:33:19.9222168Z     ) -> None:
2025-05-07T20:33:19.9222276Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9222370Z     
2025-05-07T20:33:19.9222562Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9222646Z     
2025-05-07T20:33:19.9222757Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9222899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9223101Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9223198Z         x0 = x[:, :D]
2025-05-07T20:33:19.9223290Z         x1 = x[:, D:]
2025-05-07T20:33:19.9223372Z     
2025-05-07T20:33:19.9223478Z         if contiguous:
2025-05-07T20:33:19.9223582Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9223684Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9223779Z     
2025-05-07T20:33:19.9224305Z         if scale_ub is not None:
2025-05-07T20:33:19.9224626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9224784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9224871Z             )
2025-05-07T20:33:19.9224965Z         else:
2025-05-07T20:33:19.9225072Z             scale_ub_tensor = None
2025-05-07T20:33:19.9225157Z     
2025-05-07T20:33:19.9225311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9225414Z             op = silu_mul_quant
2025-05-07T20:33:19.9225514Z             if compiled:
2025-05-07T20:33:19.9225644Z                 op = torch.compile(op)
2025-05-07T20:33:19.9225764Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9225847Z     
2025-05-07T20:33:19.9225961Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9225966Z 
2025-05-07T20:33:19.9226076Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9226228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9226352Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9226468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9227036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9227150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9227556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9227816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9228201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9228317Z     kernel = self.compile(
2025-05-07T20:33:19.9228750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9228948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9229110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9229115Z 
2025-05-07T20:33:19.9229347Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf88d60>
2025-05-07T20:33:19.9230224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9230796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1ed40>}
2025-05-07T20:33:19.9231715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9231943Z context = <triton._C.libtriton.ir.context object at 0x7efacbdca230>
2025-05-07T20:33:19.9231951Z 
2025-05-07T20:33:19.9232142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9232449Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9232572Z                            module_map=module_map)
2025-05-07T20:33:19.9232756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9232875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9232965Z E       ^
2025-05-07T20:33:19.9233470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9233475Z 
2025-05-07T20:33:19.9234012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9234017Z 
2025-05-07T20:33:19.9234137Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9234513Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9234605Z     T=16384,
2025-05-07T20:33:19.9234695Z     D=5120,
2025-05-07T20:33:19.9234798Z     scale_ub=1200.0,
2025-05-07T20:33:19.9234899Z     contiguous=True,
2025-05-07T20:33:19.9235003Z     compiled=True,
2025-05-07T20:33:19.9235088Z )
2025-05-07T20:33:19.9235335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9235537Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9235545Z 
2025-05-07T20:33:19.9235636Z     @given(
2025-05-07T20:33:19.9235771Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9235892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9236027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9236162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9236298Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9236388Z     )
2025-05-07T20:33:19.9236681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9236790Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9236878Z         self,
2025-05-07T20:33:19.9236977Z         T: int,
2025-05-07T20:33:19.9237066Z         D: int,
2025-05-07T20:33:19.9237178Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9237287Z         contiguous: bool,
2025-05-07T20:33:19.9237386Z         compiled: bool,
2025-05-07T20:33:19.9237478Z     ) -> None:
2025-05-07T20:33:19.9237598Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9237683Z     
2025-05-07T20:33:19.9237874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9237965Z     
2025-05-07T20:33:19.9238074Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9238224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9238326Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9238420Z         x0 = x[:, :D]
2025-05-07T20:33:19.9238518Z         x1 = x[:, D:]
2025-05-07T20:33:19.9238605Z     
2025-05-07T20:33:19.9238701Z         if contiguous:
2025-05-07T20:33:19.9238814Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9238916Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9238999Z     
2025-05-07T20:33:19.9239111Z         if scale_ub is not None:
2025-05-07T20:33:19.9239233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9239387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9239481Z             )
2025-05-07T20:33:19.9239572Z         else:
2025-05-07T20:33:19.9239688Z             scale_ub_tensor = None
2025-05-07T20:33:19.9239772Z     
2025-05-07T20:33:19.9239973Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9240086Z             op = silu_mul_quant
2025-05-07T20:33:19.9240184Z             if compiled:
2025-05-07T20:33:19.9240299Z                 op = torch.compile(op)
2025-05-07T20:33:19.9240428Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9240511Z     
2025-05-07T20:33:19.9240620Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9240624Z 
2025-05-07T20:33:19.9240741Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9240887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9241006Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9241128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9241545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9241713Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9242278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9242390Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9242801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9243143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9243534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9243642Z     kernel = self.compile(
2025-05-07T20:33:19.9244073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9244306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9244455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9244460Z 
2025-05-07T20:33:19.9244743Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf244c0>
2025-05-07T20:33:19.9245633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9246223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1e830>}
2025-05-07T20:33:19.9247073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9247294Z context = <triton._C.libtriton.ir.context object at 0x7efacbd5f470>
2025-05-07T20:33:19.9247301Z 
2025-05-07T20:33:19.9247497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9247802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9247931Z                            module_map=module_map)
2025-05-07T20:33:19.9248123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9248240Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9248332Z E       ^
2025-05-07T20:33:19.9248748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9248753Z 
2025-05-07T20:33:19.9249226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9249231Z 
2025-05-07T20:33:19.9249361Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9249616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9249711Z     T=16384,
2025-05-07T20:33:19.9249806Z     D=5120,
2025-05-07T20:33:19.9249904Z     scale_ub=None,
2025-05-07T20:33:19.9250055Z     contiguous=False,
2025-05-07T20:33:19.9250163Z     compiled=True,
2025-05-07T20:33:19.9250250Z )
2025-05-07T20:33:19.9250505Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9250710Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9250718Z 
2025-05-07T20:33:19.9250808Z     @given(
2025-05-07T20:33:19.9250952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9251068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9251200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9251342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9251475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9251561Z     )
2025-05-07T20:33:19.9251895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9252005Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9252104Z         self,
2025-05-07T20:33:19.9252194Z         T: int,
2025-05-07T20:33:19.9252282Z         D: int,
2025-05-07T20:33:19.9252407Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9252511Z         contiguous: bool,
2025-05-07T20:33:19.9252655Z         compiled: bool,
2025-05-07T20:33:19.9252754Z     ) -> None:
2025-05-07T20:33:19.9252904Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9252990Z     
2025-05-07T20:33:19.9253192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9253278Z     
2025-05-07T20:33:19.9253385Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9253538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9253642Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9253743Z         x0 = x[:, :D]
2025-05-07T20:33:19.9253838Z         x1 = x[:, D:]
2025-05-07T20:33:19.9253922Z     
2025-05-07T20:33:19.9254028Z         if contiguous:
2025-05-07T20:33:19.9254135Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9254241Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9254334Z     
2025-05-07T20:33:19.9254440Z         if scale_ub is not None:
2025-05-07T20:33:19.9254561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9254725Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9254815Z             )
2025-05-07T20:33:19.9254904Z         else:
2025-05-07T20:33:19.9255020Z             scale_ub_tensor = None
2025-05-07T20:33:19.9255103Z     
2025-05-07T20:33:19.9255251Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9255361Z             op = silu_mul_quant
2025-05-07T20:33:19.9255458Z             if compiled:
2025-05-07T20:33:19.9255579Z                 op = torch.compile(op)
2025-05-07T20:33:19.9255701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9255787Z     
2025-05-07T20:33:19.9255898Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9255902Z 
2025-05-07T20:33:19.9256014Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9256164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9256288Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9256404Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9256834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9256943Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9257506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9257624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9258033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9258291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9258743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9258854Z     kernel = self.compile(
2025-05-07T20:33:19.9259297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9259500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9259649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9259654Z 
2025-05-07T20:33:19.9259898Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbf55270>
2025-05-07T20:33:19.9260777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9261409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbf1f760>}
2025-05-07T20:33:19.9262255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9262566Z context = <triton._C.libtriton.ir.context object at 0x7efacbc19bb0>
2025-05-07T20:33:19.9262578Z 
2025-05-07T20:33:19.9262768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9263074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9263203Z                            module_map=module_map)
2025-05-07T20:33:19.9263387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9263502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9263600Z E       ^
2025-05-07T20:33:19.9264006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9264011Z 
2025-05-07T20:33:19.9264493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9264499Z 
2025-05-07T20:33:19.9264640Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9264921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9265020Z     T=2048,
2025-05-07T20:33:19.9265110Z     D=5120,
2025-05-07T20:33:19.9265206Z     scale_ub=None,
2025-05-07T20:33:19.9265313Z     contiguous=False,
2025-05-07T20:33:19.9265410Z     compiled=True,
2025-05-07T20:33:19.9265497Z )
2025-05-07T20:33:19.9265752Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9265949Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9265956Z 
2025-05-07T20:33:19.9266053Z     @given(
2025-05-07T20:33:19.9266188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9266306Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9266445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9266580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9266715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9271695Z     )
2025-05-07T20:33:19.9272003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9272114Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9272211Z         self,
2025-05-07T20:33:19.9272300Z         T: int,
2025-05-07T20:33:19.9272399Z         D: int,
2025-05-07T20:33:19.9272512Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9272616Z         contiguous: bool,
2025-05-07T20:33:19.9272722Z         compiled: bool,
2025-05-07T20:33:19.9272817Z     ) -> None:
2025-05-07T20:33:19.9272930Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9273024Z     
2025-05-07T20:33:19.9273335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9273423Z     
2025-05-07T20:33:19.9273731Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9273877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9273983Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9274084Z         x0 = x[:, :D]
2025-05-07T20:33:19.9274179Z         x1 = x[:, D:]
2025-05-07T20:33:19.9274271Z     
2025-05-07T20:33:19.9274369Z         if contiguous:
2025-05-07T20:33:19.9274474Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9274585Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9274668Z     
2025-05-07T20:33:19.9274775Z         if scale_ub is not None:
2025-05-07T20:33:19.9274905Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9275061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9275208Z             )
2025-05-07T20:33:19.9275305Z         else:
2025-05-07T20:33:19.9275415Z             scale_ub_tensor = None
2025-05-07T20:33:19.9275501Z     
2025-05-07T20:33:19.9275660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9275768Z             op = silu_mul_quant
2025-05-07T20:33:19.9275866Z             if compiled:
2025-05-07T20:33:19.9276039Z                 op = torch.compile(op)
2025-05-07T20:33:19.9276203Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9276293Z     
2025-05-07T20:33:19.9276398Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9276404Z 
2025-05-07T20:33:19.9276517Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9276674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9276794Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9276914Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9277346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9277461Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9278038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9278152Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9278563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9278835Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9279225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9279334Z     kernel = self.compile(
2025-05-07T20:33:19.9279777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9279979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9280133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9280138Z 
2025-05-07T20:33:19.9280377Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbdc3ca0>
2025-05-07T20:33:19.9281263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9281854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce43a0>}
2025-05-07T20:33:19.9282698Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9282929Z context = <triton._C.libtriton.ir.context object at 0x7efacbc887f0>
2025-05-07T20:33:19.9282934Z 
2025-05-07T20:33:19.9283173Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9283482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9283607Z                            module_map=module_map)
2025-05-07T20:33:19.9283796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9283917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9284008Z E       ^
2025-05-07T20:33:19.9284412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9284417Z 
2025-05-07T20:33:19.9284894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9284899Z 
2025-05-07T20:33:19.9285020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9285334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9285424Z     T=2048,
2025-05-07T20:33:19.9285511Z     D=5120,
2025-05-07T20:33:19.9285615Z     scale_ub=1200.0,
2025-05-07T20:33:19.9285715Z     contiguous=False,
2025-05-07T20:33:19.9285812Z     compiled=True,
2025-05-07T20:33:19.9285903Z )
2025-05-07T20:33:19.9286151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9286443Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9286456Z 
2025-05-07T20:33:19.9286547Z     @given(
2025-05-07T20:33:19.9286683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9286805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9286939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9287074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9287213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9287303Z     )
2025-05-07T20:33:19.9287584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9287702Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9287793Z         self,
2025-05-07T20:33:19.9287882Z         T: int,
2025-05-07T20:33:19.9287978Z         D: int,
2025-05-07T20:33:19.9288093Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9288207Z         contiguous: bool,
2025-05-07T20:33:19.9288312Z         compiled: bool,
2025-05-07T20:33:19.9288407Z     ) -> None:
2025-05-07T20:33:19.9288525Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9288610Z     
2025-05-07T20:33:19.9288804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9288897Z     
2025-05-07T20:33:19.9289004Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9289150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9289264Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9289361Z         x0 = x[:, :D]
2025-05-07T20:33:19.9289453Z         x1 = x[:, D:]
2025-05-07T20:33:19.9289545Z     
2025-05-07T20:33:19.9289641Z         if contiguous:
2025-05-07T20:33:19.9289750Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9289862Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9289946Z     
2025-05-07T20:33:19.9290058Z         if scale_ub is not None:
2025-05-07T20:33:19.9290184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9290343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9290441Z             )
2025-05-07T20:33:19.9290534Z         else:
2025-05-07T20:33:19.9290645Z             scale_ub_tensor = None
2025-05-07T20:33:19.9290737Z     
2025-05-07T20:33:19.9290887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9290995Z             op = silu_mul_quant
2025-05-07T20:33:19.9291101Z             if compiled:
2025-05-07T20:33:19.9291217Z                 op = torch.compile(op)
2025-05-07T20:33:19.9291343Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9291434Z     
2025-05-07T20:33:19.9291540Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9291544Z 
2025-05-07T20:33:19.9291718Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9291869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9291988Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9292113Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9292532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9292641Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9293212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9293326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9293741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9294045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9294434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9294548Z     kernel = self.compile(
2025-05-07T20:33:19.9294984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9295278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9295424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9295430Z 
2025-05-07T20:33:19.9295669Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbcf03a0>
2025-05-07T20:33:19.9296559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9297145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce4820>}
2025-05-07T20:33:19.9297997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9298224Z context = <triton._C.libtriton.ir.context object at 0x7efacbe3b0f0>
2025-05-07T20:33:19.9298229Z 
2025-05-07T20:33:19.9298420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9298730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9298856Z                            module_map=module_map)
2025-05-07T20:33:19.9299048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9299165Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9299256Z E       ^
2025-05-07T20:33:19.9299670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9299675Z 
2025-05-07T20:33:19.9300146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9300154Z 
2025-05-07T20:33:19.9300284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9300538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9300628Z     T=4096,
2025-05-07T20:33:19.9300722Z     D=5120,
2025-05-07T20:33:19.9300818Z     scale_ub=1200.0,
2025-05-07T20:33:19.9300916Z     contiguous=True,
2025-05-07T20:33:19.9301023Z     compiled=True,
2025-05-07T20:33:19.9301108Z )
2025-05-07T20:33:19.9301359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9301567Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9301572Z 
2025-05-07T20:33:19.9301662Z     @given(
2025-05-07T20:33:19.9301848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9301974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9302107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9302250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9302385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9302472Z     )
2025-05-07T20:33:19.9302760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9302868Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9302957Z         self,
2025-05-07T20:33:19.9303052Z         T: int,
2025-05-07T20:33:19.9303140Z         D: int,
2025-05-07T20:33:19.9303254Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9303364Z         contiguous: bool,
2025-05-07T20:33:19.9303511Z         compiled: bool,
2025-05-07T20:33:19.9303610Z     ) -> None:
2025-05-07T20:33:19.9303720Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9303804Z     
2025-05-07T20:33:19.9304009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9304095Z     
2025-05-07T20:33:19.9304203Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9304402Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9304575Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9304670Z         x0 = x[:, :D]
2025-05-07T20:33:19.9304770Z         x1 = x[:, D:]
2025-05-07T20:33:19.9304854Z     
2025-05-07T20:33:19.9304952Z         if contiguous:
2025-05-07T20:33:19.9305071Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9305175Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9305262Z     
2025-05-07T20:33:19.9305374Z         if scale_ub is not None:
2025-05-07T20:33:19.9305496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9305661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9305750Z             )
2025-05-07T20:33:19.9305839Z         else:
2025-05-07T20:33:19.9305958Z             scale_ub_tensor = None
2025-05-07T20:33:19.9306044Z     
2025-05-07T20:33:19.9306197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9306309Z             op = silu_mul_quant
2025-05-07T20:33:19.9306412Z             if compiled:
2025-05-07T20:33:19.9306531Z                 op = torch.compile(op)
2025-05-07T20:33:19.9306661Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9306747Z     
2025-05-07T20:33:19.9306853Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9306864Z 
2025-05-07T20:33:19.9306977Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9307125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9307249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9307366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9307792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9307909Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9308474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9308589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9309011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9309268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9309663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9309772Z     kernel = self.compile(
2025-05-07T20:33:19.9310212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9310428Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9310629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9310635Z 
2025-05-07T20:33:19.9310881Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbee0760>
2025-05-07T20:33:19.9311763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9312347Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce5360>}
2025-05-07T20:33:19.9313199Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9313468Z context = <triton._C.libtriton.ir.context object at 0x7efacbe11cf0>
2025-05-07T20:33:19.9313473Z 
2025-05-07T20:33:19.9313773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9314076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9314247Z                            module_map=module_map)
2025-05-07T20:33:19.9314487Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9314607Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9314706Z E       ^
2025-05-07T20:33:19.9315112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9315117Z 
2025-05-07T20:33:19.9315588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9315596Z 
2025-05-07T20:33:19.9315723Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9315977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9316078Z     T=128,
2025-05-07T20:33:19.9316168Z     D=5120,
2025-05-07T20:33:19.9316264Z     scale_ub=1200.0,
2025-05-07T20:33:19.9316373Z     contiguous=False,
2025-05-07T20:33:19.9316470Z     compiled=True,
2025-05-07T20:33:19.9316559Z )
2025-05-07T20:33:19.9316817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9317015Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9317019Z 
2025-05-07T20:33:19.9317111Z     @given(
2025-05-07T20:33:19.9317253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9317368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9317506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9317641Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9317776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9317868Z     )
2025-05-07T20:33:19.9318153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9318262Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9318361Z         self,
2025-05-07T20:33:19.9318449Z         T: int,
2025-05-07T20:33:19.9318537Z         D: int,
2025-05-07T20:33:19.9318666Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9318775Z         contiguous: bool,
2025-05-07T20:33:19.9318874Z         compiled: bool,
2025-05-07T20:33:19.9318972Z     ) -> None:
2025-05-07T20:33:19.9319080Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9319170Z     
2025-05-07T20:33:19.9319366Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9319452Z     
2025-05-07T20:33:19.9319563Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9319707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9319815Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9319915Z         x0 = x[:, :D]
2025-05-07T20:33:19.9320008Z         x1 = x[:, D:]
2025-05-07T20:33:19.9320091Z     
2025-05-07T20:33:19.9320251Z         if contiguous:
2025-05-07T20:33:19.9320359Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9320462Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9320553Z     
2025-05-07T20:33:19.9320663Z         if scale_ub is not None:
2025-05-07T20:33:19.9320791Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9320952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9321041Z             )
2025-05-07T20:33:19.9321140Z         else:
2025-05-07T20:33:19.9321249Z             scale_ub_tensor = None
2025-05-07T20:33:19.9321335Z     
2025-05-07T20:33:19.9321492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9321598Z             op = silu_mul_quant
2025-05-07T20:33:19.9321696Z             if compiled:
2025-05-07T20:33:19.9321868Z                 op = torch.compile(op)
2025-05-07T20:33:19.9321991Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9322077Z     
2025-05-07T20:33:19.9322193Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9322198Z 
2025-05-07T20:33:19.9322310Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9322465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9322630Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9322789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9323216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9323324Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9324401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9324555Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9324968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9325234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9325622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9325732Z     kernel = self.compile(
2025-05-07T20:33:19.9326184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9326390Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9326536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9326548Z 
2025-05-07T20:33:19.9326787Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbe9b010>
2025-05-07T20:33:19.9327671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9328266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce6290>}
2025-05-07T20:33:19.9329115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9329351Z context = <triton._C.libtriton.ir.context object at 0x7efacbb80570>
2025-05-07T20:33:19.9329356Z 
2025-05-07T20:33:19.9329546Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9329851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9329986Z                            module_map=module_map)
2025-05-07T20:33:19.9330175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9330292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9330389Z E       ^
2025-05-07T20:33:19.9330989Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9330996Z 
2025-05-07T20:33:19.9331478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9331485Z 
2025-05-07T20:33:19.9331605Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9331859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9331958Z     T=16384,
2025-05-07T20:33:19.9332048Z     D=7168,
2025-05-07T20:33:19.9332151Z     scale_ub=1200.0,
2025-05-07T20:33:19.9332248Z     contiguous=True,
2025-05-07T20:33:19.9332348Z     compiled=True,
2025-05-07T20:33:19.9332439Z )
2025-05-07T20:33:19.9332762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9332964Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9332973Z 
2025-05-07T20:33:19.9333068Z     @given(
2025-05-07T20:33:19.9333206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9333320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9333533Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9333756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9333920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9334007Z     )
2025-05-07T20:33:19.9334290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9334408Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9334497Z         self,
2025-05-07T20:33:19.9334585Z         T: int,
2025-05-07T20:33:19.9334680Z         D: int,
2025-05-07T20:33:19.9334801Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9334904Z         contiguous: bool,
2025-05-07T20:33:19.9335014Z         compiled: bool,
2025-05-07T20:33:19.9335106Z     ) -> None:
2025-05-07T20:33:19.9335218Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9335309Z     
2025-05-07T20:33:19.9335503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9335596Z     
2025-05-07T20:33:19.9335706Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9335855Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9335967Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9336060Z         x0 = x[:, :D]
2025-05-07T20:33:19.9336152Z         x1 = x[:, D:]
2025-05-07T20:33:19.9336243Z     
2025-05-07T20:33:19.9336339Z         if contiguous:
2025-05-07T20:33:19.9336446Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9336554Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9336640Z     
2025-05-07T20:33:19.9336744Z         if scale_ub is not None:
2025-05-07T20:33:19.9336875Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9337034Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9337120Z             )
2025-05-07T20:33:19.9337219Z         else:
2025-05-07T20:33:19.9337328Z             scale_ub_tensor = None
2025-05-07T20:33:19.9337418Z     
2025-05-07T20:33:19.9337567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9337676Z             op = silu_mul_quant
2025-05-07T20:33:19.9337784Z             if compiled:
2025-05-07T20:33:19.9337900Z                 op = torch.compile(op)
2025-05-07T20:33:19.9338023Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9338114Z     
2025-05-07T20:33:19.9338219Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9338224Z 
2025-05-07T20:33:19.9338335Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9338495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9338612Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9338735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9339210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9339319Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9339906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9340027Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9340436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9340698Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9341087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9341204Z     kernel = self.compile(
2025-05-07T20:33:19.9341640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9341890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9342049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9342054Z 
2025-05-07T20:33:19.9342291Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbbbae60>
2025-05-07T20:33:19.9343300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9343885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce6d40>}
2025-05-07T20:33:19.9344730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9344967Z context = <triton._C.libtriton.ir.context object at 0x7efacbb8f770>
2025-05-07T20:33:19.9344972Z 
2025-05-07T20:33:19.9345162Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9345473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9345605Z                            module_map=module_map)
2025-05-07T20:33:19.9345793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9345915Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9346006Z E       ^
2025-05-07T20:33:19.9346420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9346425Z 
2025-05-07T20:33:19.9346896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9346903Z 
2025-05-07T20:33:19.9347026Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9347291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9347382Z     T=16384,
2025-05-07T20:33:19.9347470Z     D=5120,
2025-05-07T20:33:19.9347575Z     scale_ub=1200.0,
2025-05-07T20:33:19.9347676Z     contiguous=True,
2025-05-07T20:33:19.9347785Z     compiled=False,
2025-05-07T20:33:19.9347875Z )
2025-05-07T20:33:19.9348124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9348333Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9348338Z 
2025-05-07T20:33:19.9348427Z     @given(
2025-05-07T20:33:19.9348564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9348690Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9348824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9348961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9349101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9349240Z     )
2025-05-07T20:33:19.9349530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9349639Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9349727Z         self,
2025-05-07T20:33:19.9349823Z         T: int,
2025-05-07T20:33:19.9349912Z         D: int,
2025-05-07T20:33:19.9350028Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9350138Z         contiguous: bool,
2025-05-07T20:33:19.9350241Z         compiled: bool,
2025-05-07T20:33:19.9350334Z     ) -> None:
2025-05-07T20:33:19.9350449Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9350536Z     
2025-05-07T20:33:19.9350729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9350822Z     
2025-05-07T20:33:19.9350929Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9351130Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9351234Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9351327Z         x0 = x[:, :D]
2025-05-07T20:33:19.9351431Z         x1 = x[:, D:]
2025-05-07T20:33:19.9351515Z     
2025-05-07T20:33:19.9351616Z         if contiguous:
2025-05-07T20:33:19.9351728Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9351832Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9351962Z     
2025-05-07T20:33:19.9352115Z         if scale_ub is not None:
2025-05-07T20:33:19.9352238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9352394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9352488Z             )
2025-05-07T20:33:19.9352577Z         else:
2025-05-07T20:33:19.9352687Z             scale_ub_tensor = None
2025-05-07T20:33:19.9352779Z     
2025-05-07T20:33:19.9352929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9353038Z             op = silu_mul_quant
2025-05-07T20:33:19.9353141Z             if compiled:
2025-05-07T20:33:19.9353256Z                 op = torch.compile(op)
2025-05-07T20:33:19.9353387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9353471Z     
2025-05-07T20:33:19.9353695Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9353700Z 
2025-05-07T20:33:19.9353820Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9353975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9354121Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9354269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9354837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9354958Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9355370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9355630Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9356031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9356139Z     kernel = self.compile(
2025-05-07T20:33:19.9356583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9356793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9356938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9356943Z 
2025-05-07T20:33:19.9357186Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb8083a0>
2025-05-07T20:33:19.9358067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9358704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbce7ac0>}
2025-05-07T20:33:19.9359550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9359779Z context = <triton._C.libtriton.ir.context object at 0x7efacbbb43f0>
2025-05-07T20:33:19.9359784Z 
2025-05-07T20:33:19.9359984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9360287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9360419Z                            module_map=module_map)
2025-05-07T20:33:19.9360605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9360766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9360866Z E       ^
2025-05-07T20:33:19.9361275Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9361279Z 
2025-05-07T20:33:19.9361747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9361806Z 
2025-05-07T20:33:19.9361928Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9362226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9362327Z     T=1,
2025-05-07T20:33:19.9362416Z     D=7168,
2025-05-07T20:33:19.9362512Z     scale_ub=1200.0,
2025-05-07T20:33:19.9362624Z     contiguous=False,
2025-05-07T20:33:19.9362722Z     compiled=False,
2025-05-07T20:33:19.9362811Z )
2025-05-07T20:33:19.9363069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9363262Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9363271Z 
2025-05-07T20:33:19.9363370Z     @given(
2025-05-07T20:33:19.9363513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9363652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9363810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9363953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9364089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9364185Z     )
2025-05-07T20:33:19.9364467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9364577Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9364674Z         self,
2025-05-07T20:33:19.9364764Z         T: int,
2025-05-07T20:33:19.9364854Z         D: int,
2025-05-07T20:33:19.9364976Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9365081Z         contiguous: bool,
2025-05-07T20:33:19.9365188Z         compiled: bool,
2025-05-07T20:33:19.9365283Z     ) -> None:
2025-05-07T20:33:19.9365394Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9365485Z     
2025-05-07T20:33:19.9365681Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9365767Z     
2025-05-07T20:33:19.9365880Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9366023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9366132Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9366237Z         x0 = x[:, :D]
2025-05-07T20:33:19.9366329Z         x1 = x[:, D:]
2025-05-07T20:33:19.9366413Z     
2025-05-07T20:33:19.9366515Z         if contiguous:
2025-05-07T20:33:19.9366621Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9366731Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9366815Z     
2025-05-07T20:33:19.9366919Z         if scale_ub is not None:
2025-05-07T20:33:19.9367047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9367203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9367294Z             )
2025-05-07T20:33:19.9367390Z         else:
2025-05-07T20:33:19.9367498Z             scale_ub_tensor = None
2025-05-07T20:33:19.9367638Z     
2025-05-07T20:33:19.9367798Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9367903Z             op = silu_mul_quant
2025-05-07T20:33:19.9368001Z             if compiled:
2025-05-07T20:33:19.9368129Z                 op = torch.compile(op)
2025-05-07T20:33:19.9368253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9368344Z     
2025-05-07T20:33:19.9368449Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9368454Z 
2025-05-07T20:33:19.9368566Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9368718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9368834Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9368950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9369526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9369687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9370104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9370360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9370846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9370962Z     kernel = self.compile(
2025-05-07T20:33:19.9371398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9371602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9371753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9371761Z 
2025-05-07T20:33:19.9371997Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb8807f0>
2025-05-07T20:33:19.9372879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9373463Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdc4c0>}
2025-05-07T20:33:19.9374369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9374593Z context = <triton._C.libtriton.ir.context object at 0x7efacb89bcb0>
2025-05-07T20:33:19.9374598Z 
2025-05-07T20:33:19.9374787Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9375095Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9375220Z                            module_map=module_map)
2025-05-07T20:33:19.9375406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9375527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9375616Z E       ^
2025-05-07T20:33:19.9376034Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9376039Z 
2025-05-07T20:33:19.9376510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9376515Z 
2025-05-07T20:33:19.9376635Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9376898Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9376988Z     T=4096,
2025-05-07T20:33:19.9377085Z     D=7168,
2025-05-07T20:33:19.9377183Z     scale_ub=1200.0,
2025-05-07T20:33:19.9377284Z     contiguous=False,
2025-05-07T20:33:19.9377387Z     compiled=True,
2025-05-07T20:33:19.9377472Z )
2025-05-07T20:33:19.9377772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9377983Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9377991Z 
2025-05-07T20:33:19.9378082Z     @given(
2025-05-07T20:33:19.9378221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9378345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9378481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9378628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9378760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9378847Z     )
2025-05-07T20:33:19.9379137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9379331Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9379422Z         self,
2025-05-07T20:33:19.9379520Z         T: int,
2025-05-07T20:33:19.9379611Z         D: int,
2025-05-07T20:33:19.9379727Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9379837Z         contiguous: bool,
2025-05-07T20:33:19.9379939Z         compiled: bool,
2025-05-07T20:33:19.9380029Z     ) -> None:
2025-05-07T20:33:19.9380194Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9380280Z     
2025-05-07T20:33:19.9380524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9380611Z     
2025-05-07T20:33:19.9380719Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9380872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9380975Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9381068Z         x0 = x[:, :D]
2025-05-07T20:33:19.9381167Z         x1 = x[:, D:]
2025-05-07T20:33:19.9381252Z     
2025-05-07T20:33:19.9381350Z         if contiguous:
2025-05-07T20:33:19.9381469Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9381572Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9381657Z     
2025-05-07T20:33:19.9381772Z         if scale_ub is not None:
2025-05-07T20:33:19.9381893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9382053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9382147Z             )
2025-05-07T20:33:19.9382239Z         else:
2025-05-07T20:33:19.9382357Z             scale_ub_tensor = None
2025-05-07T20:33:19.9382443Z     
2025-05-07T20:33:19.9382592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9382702Z             op = silu_mul_quant
2025-05-07T20:33:19.9382800Z             if compiled:
2025-05-07T20:33:19.9382915Z                 op = torch.compile(op)
2025-05-07T20:33:19.9383044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9383128Z     
2025-05-07T20:33:19.9383232Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9383240Z 
2025-05-07T20:33:19.9383361Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9383508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9383636Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9383752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9384173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9384292Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9384857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9384970Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9385383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9385639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9386038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9386147Z     kernel = self.compile(
2025-05-07T20:33:19.9386637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9386847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9386994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9387002Z 
2025-05-07T20:33:19.9387244Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb888a30>
2025-05-07T20:33:19.9388126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9388707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdd1b0>}
2025-05-07T20:33:19.9389604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9389825Z context = <triton._C.libtriton.ir.context object at 0x7efacb8b85f0>
2025-05-07T20:33:19.9389872Z 
2025-05-07T20:33:19.9390107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9390410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9390534Z                            module_map=module_map)
2025-05-07T20:33:19.9390726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9390840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9390929Z E       ^
2025-05-07T20:33:19.9391340Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9391348Z 
2025-05-07T20:33:19.9391821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9391826Z 
2025-05-07T20:33:19.9391953Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9392207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9392300Z     T=128,
2025-05-07T20:33:19.9392399Z     D=7168,
2025-05-07T20:33:19.9392498Z     scale_ub=1200.0,
2025-05-07T20:33:19.9392605Z     contiguous=False,
2025-05-07T20:33:19.9392702Z     compiled=True,
2025-05-07T20:33:19.9392787Z )
2025-05-07T20:33:19.9393041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9393240Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.9393245Z 
2025-05-07T20:33:19.9393334Z     @given(
2025-05-07T20:33:19.9393479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9393701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9393837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9393981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9394119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9394213Z     )
2025-05-07T20:33:19.9394504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9394615Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9394714Z         self,
2025-05-07T20:33:19.9394803Z         T: int,
2025-05-07T20:33:19.9394892Z         D: int,
2025-05-07T20:33:19.9395014Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9395120Z         contiguous: bool,
2025-05-07T20:33:19.9395221Z         compiled: bool,
2025-05-07T20:33:19.9395321Z     ) -> None:
2025-05-07T20:33:19.9395431Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9395519Z     
2025-05-07T20:33:19.9395721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9395807Z     
2025-05-07T20:33:19.9395968Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9396125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9396228Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9396328Z         x0 = x[:, :D]
2025-05-07T20:33:19.9396424Z         x1 = x[:, D:]
2025-05-07T20:33:19.9396509Z     
2025-05-07T20:33:19.9396615Z         if contiguous:
2025-05-07T20:33:19.9396720Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9396823Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9396915Z     
2025-05-07T20:33:19.9397020Z         if scale_ub is not None:
2025-05-07T20:33:19.9397142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9397306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9397398Z             )
2025-05-07T20:33:19.9397486Z         else:
2025-05-07T20:33:19.9397646Z             scale_ub_tensor = None
2025-05-07T20:33:19.9397732Z     
2025-05-07T20:33:19.9397893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9398001Z             op = silu_mul_quant
2025-05-07T20:33:19.9398103Z             if compiled:
2025-05-07T20:33:19.9398226Z                 op = torch.compile(op)
2025-05-07T20:33:19.9398348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9398478Z     
2025-05-07T20:33:19.9398630Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9398636Z 
2025-05-07T20:33:19.9398752Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9398898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9399020Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9399135Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9399558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9399671Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9400234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9400357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9400766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9401024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9401421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9401529Z     kernel = self.compile(
2025-05-07T20:33:19.9401971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9402173Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9402319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9402327Z 
2025-05-07T20:33:19.9402569Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb865ba0>
2025-05-07T20:33:19.9403452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9404096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbdc0d0>}
2025-05-07T20:33:19.9404943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9405170Z context = <triton._C.libtriton.ir.context object at 0x7efacb92ecf0>
2025-05-07T20:33:19.9405178Z 
2025-05-07T20:33:19.9405367Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9405719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9405851Z                            module_map=module_map)
2025-05-07T20:33:19.9406038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9406152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9406251Z E       ^
2025-05-07T20:33:19.9406657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9406662Z 
2025-05-07T20:33:19.9407136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9407141Z 
2025-05-07T20:33:19.9407261Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9407516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9407661Z     T=2048,
2025-05-07T20:33:19.9407750Z     D=7168,
2025-05-07T20:33:19.9407846Z     scale_ub=None,
2025-05-07T20:33:19.9407953Z     contiguous=True,
2025-05-07T20:33:19.9408052Z     compiled=True,
2025-05-07T20:33:19.9408137Z )
2025-05-07T20:33:19.9408392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9408588Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.9408639Z 
2025-05-07T20:33:19.9408777Z     @given(
2025-05-07T20:33:19.9408915Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9409030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9409173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9409308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9409440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9409534Z     )
2025-05-07T20:33:19.9409815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9409936Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9410025Z         self,
2025-05-07T20:33:19.9410115Z         T: int,
2025-05-07T20:33:19.9410213Z         D: int,
2025-05-07T20:33:19.9410326Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9410430Z         contiguous: bool,
2025-05-07T20:33:19.9410535Z         compiled: bool,
2025-05-07T20:33:19.9410630Z     ) -> None:
2025-05-07T20:33:19.9410742Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9416121Z     
2025-05-07T20:33:19.9416343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9416433Z     
2025-05-07T20:33:19.9416553Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9416700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9416807Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9416910Z         x0 = x[:, :D]
2025-05-07T20:33:19.9417007Z         x1 = x[:, D:]
2025-05-07T20:33:19.9417107Z     
2025-05-07T20:33:19.9417206Z         if contiguous:
2025-05-07T20:33:19.9417314Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9417426Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9417515Z     
2025-05-07T20:33:19.9417622Z         if scale_ub is not None:
2025-05-07T20:33:19.9417754Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9417911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9418002Z             )
2025-05-07T20:33:19.9418102Z         else:
2025-05-07T20:33:19.9418211Z             scale_ub_tensor = None
2025-05-07T20:33:19.9418299Z     
2025-05-07T20:33:19.9418456Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9418564Z             op = silu_mul_quant
2025-05-07T20:33:19.9418671Z             if compiled:
2025-05-07T20:33:19.9418788Z                 op = torch.compile(op)
2025-05-07T20:33:19.9418910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9419003Z     
2025-05-07T20:33:19.9419111Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9419117Z 
2025-05-07T20:33:19.9419231Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9419472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9419593Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9419714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9420149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9420262Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9420842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9420955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9421366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9421628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9422067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9422178Z     kernel = self.compile(
2025-05-07T20:33:19.9422623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9422827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9423068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9423074Z 
2025-05-07T20:33:19.9423315Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb931b40>
2025-05-07T20:33:19.9424969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9425567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbbde560>}
2025-05-07T20:33:19.9426417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9426649Z context = <triton._C.libtriton.ir.context object at 0x7efacb9d1df0>
2025-05-07T20:33:19.9426658Z 
2025-05-07T20:33:19.9426846Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9427156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9427280Z                            module_map=module_map)
2025-05-07T20:33:19.9427466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9427586Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9427678Z E       ^
2025-05-07T20:33:19.9428081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9428087Z 
2025-05-07T20:33:19.9428563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9428568Z 
2025-05-07T20:33:19.9428687Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9428952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9429044Z     T=16384,
2025-05-07T20:33:19.9429133Z     D=5120,
2025-05-07T20:33:19.9429235Z     scale_ub=None,
2025-05-07T20:33:19.9429335Z     contiguous=False,
2025-05-07T20:33:19.9429432Z     compiled=False,
2025-05-07T20:33:19.9429524Z )
2025-05-07T20:33:19.9429771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9429974Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9429989Z 
2025-05-07T20:33:19.9430079Z     @given(
2025-05-07T20:33:19.9430215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9430522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9430659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9430795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9430932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9431020Z     )
2025-05-07T20:33:19.9431305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9431420Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9431509Z         self,
2025-05-07T20:33:19.9431603Z         T: int,
2025-05-07T20:33:19.9431698Z         D: int,
2025-05-07T20:33:19.9431812Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9431926Z         contiguous: bool,
2025-05-07T20:33:19.9432025Z         compiled: bool,
2025-05-07T20:33:19.9432117Z     ) -> None:
2025-05-07T20:33:19.9432311Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9432396Z     
2025-05-07T20:33:19.9432594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9432690Z     
2025-05-07T20:33:19.9432797Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9432941Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9435170Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9435243Z 
2025-05-07T20:33:19.9435386Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.9435391Z 
2025-05-07T20:33:19.9435518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9435775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9435872Z     T=4096,
2025-05-07T20:33:19.9435961Z     D=7168,
2025-05-07T20:33:19.9436058Z     scale_ub=1200.0,
2025-05-07T20:33:19.9436162Z     contiguous=True,
2025-05-07T20:33:19.9436263Z     compiled=True,
2025-05-07T20:33:19.9436351Z )
2025-05-07T20:33:19.9436607Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9436803Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9436808Z 
2025-05-07T20:33:19.9436897Z     @given(
2025-05-07T20:33:19.9437039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9437154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9437292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9437431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9437564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9437659Z     )
2025-05-07T20:33:19.9437941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9438051Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9438147Z         self,
2025-05-07T20:33:19.9438239Z         T: int,
2025-05-07T20:33:19.9438329Z         D: int,
2025-05-07T20:33:19.9438452Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9438556Z         contiguous: bool,
2025-05-07T20:33:19.9438655Z         compiled: bool,
2025-05-07T20:33:19.9438753Z     ) -> None:
2025-05-07T20:33:19.9438863Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9438954Z     
2025-05-07T20:33:19.9439147Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9439238Z     
2025-05-07T20:33:19.9439355Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9439499Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9441562Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9441583Z 
2025-05-07T20:33:19.9441723Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.9441727Z 
2025-05-07T20:33:19.9441845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9442105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9442239Z     T=16384,
2025-05-07T20:33:19.9442329Z     D=7168,
2025-05-07T20:33:19.9442433Z     scale_ub=None,
2025-05-07T20:33:19.9442532Z     contiguous=False,
2025-05-07T20:33:19.9442634Z     compiled=False,
2025-05-07T20:33:19.9442721Z )
2025-05-07T20:33:19.9442965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9443171Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9443221Z 
2025-05-07T20:33:19.9443316Z     @given(
2025-05-07T20:33:19.9443495Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9443616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9443747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9443881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9444017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9444105Z     )
2025-05-07T20:33:19.9444390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9444502Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9444590Z         self,
2025-05-07T20:33:19.9444684Z         T: int,
2025-05-07T20:33:19.9444775Z         D: int,
2025-05-07T20:33:19.9444888Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9444999Z         contiguous: bool,
2025-05-07T20:33:19.9445097Z         compiled: bool,
2025-05-07T20:33:19.9445187Z     ) -> None:
2025-05-07T20:33:19.9445304Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9445392Z     
2025-05-07T20:33:19.9445584Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9447612Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9447622Z 
2025-05-07T20:33:19.9447762Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9447767Z 
2025-05-07T20:33:19.9447885Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9448140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9448237Z     T=2048,
2025-05-07T20:33:19.9448325Z     D=7168,
2025-05-07T20:33:19.9448419Z     scale_ub=1200.0,
2025-05-07T20:33:19.9448522Z     contiguous=True,
2025-05-07T20:33:19.9448619Z     compiled=True,
2025-05-07T20:33:19.9448702Z )
2025-05-07T20:33:19.9448952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9449145Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9449342Z 
2025-05-07T20:33:19.9449435Z     @given(
2025-05-07T20:33:19.9449569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9449682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9449874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9450009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9450139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9450234Z     )
2025-05-07T20:33:19.9450516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9450623Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9450717Z         self,
2025-05-07T20:33:19.9450805Z         T: int,
2025-05-07T20:33:19.9450900Z         D: int,
2025-05-07T20:33:19.9451013Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9451115Z         contiguous: bool,
2025-05-07T20:33:19.9451219Z         compiled: bool,
2025-05-07T20:33:19.9451309Z     ) -> None:
2025-05-07T20:33:19.9451467Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9451556Z     
2025-05-07T20:33:19.9451751Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9451838Z     
2025-05-07T20:33:19.9451955Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9452098Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9454441Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9454503Z 
2025-05-07T20:33:19.9454672Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.9454682Z 
2025-05-07T20:33:19.9454838Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9455155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9455266Z     T=2048,
2025-05-07T20:33:19.9455384Z     D=7168,
2025-05-07T20:33:19.9455504Z     scale_ub=None,
2025-05-07T20:33:19.9455627Z     contiguous=True,
2025-05-07T20:33:19.9455762Z     compiled=False,
2025-05-07T20:33:19.9455865Z )
2025-05-07T20:33:19.9456113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9456315Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9456320Z 
2025-05-07T20:33:19.9456408Z     @given(
2025-05-07T20:33:19.9456549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9456662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9456793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9456935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9457066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9457151Z     )
2025-05-07T20:33:19.9457442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9457554Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9457646Z         self,
2025-05-07T20:33:19.9457742Z         T: int,
2025-05-07T20:33:19.9457834Z         D: int,
2025-05-07T20:33:19.9457951Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9458062Z         contiguous: bool,
2025-05-07T20:33:19.9458160Z         compiled: bool,
2025-05-07T20:33:19.9458255Z     ) -> None:
2025-05-07T20:33:19.9458365Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9458449Z     
2025-05-07T20:33:19.9458648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9458736Z     
2025-05-07T20:33:19.9458843Z >       x_sign = torch.sign(x)
2025-05-07T20:33:19.9460899Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9460911Z 
2025-05-07T20:33:19.9461048Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:19.9461053Z 
2025-05-07T20:33:19.9461177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9461427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9461515Z     T=1,
2025-05-07T20:33:19.9461610Z     D=7168,
2025-05-07T20:33:19.9461705Z     scale_ub=1200.0,
2025-05-07T20:33:19.9461857Z     contiguous=True,
2025-05-07T20:33:19.9461954Z     compiled=False,
2025-05-07T20:33:19.9462038Z )
2025-05-07T20:33:19.9462292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9462480Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9462485Z 
2025-05-07T20:33:19.9462574Z     @given(
2025-05-07T20:33:19.9462716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9462917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9463055Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9463227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9463388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9463499Z     )
2025-05-07T20:33:19.9463849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9463982Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9464098Z         self,
2025-05-07T20:33:19.9464213Z         T: int,
2025-05-07T20:33:19.9464322Z         D: int,
2025-05-07T20:33:19.9464470Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9464601Z         contiguous: bool,
2025-05-07T20:33:19.9464723Z         compiled: bool,
2025-05-07T20:33:19.9464841Z     ) -> None:
2025-05-07T20:33:19.9464976Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9465079Z     
2025-05-07T20:33:19.9465299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9465388Z     
2025-05-07T20:33:19.9465495Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9465644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9465746Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9465847Z         x0 = x[:, :D]
2025-05-07T20:33:19.9465940Z         x1 = x[:, D:]
2025-05-07T20:33:19.9466023Z     
2025-05-07T20:33:19.9466126Z         if contiguous:
2025-05-07T20:33:19.9466232Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9466342Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9466431Z     
2025-05-07T20:33:19.9466536Z         if scale_ub is not None:
2025-05-07T20:33:19.9466659Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9466826Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9466915Z             )
2025-05-07T20:33:19.9467006Z         else:
2025-05-07T20:33:19.9467121Z             scale_ub_tensor = None
2025-05-07T20:33:19.9467209Z     
2025-05-07T20:33:19.9467366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9467470Z             op = silu_mul_quant
2025-05-07T20:33:19.9467570Z             if compiled:
2025-05-07T20:33:19.9467692Z                 op = torch.compile(op)
2025-05-07T20:33:19.9467813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9467898Z     
2025-05-07T20:33:19.9468008Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9468013Z 
2025-05-07T20:33:19.9468125Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9468276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9468400Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9468574Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9469152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9469265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9469679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9469942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9470327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9470439Z     kernel = self.compile(
2025-05-07T20:33:19.9470884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9471130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9471285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9471290Z 
2025-05-07T20:33:19.9471526Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb60eda0>
2025-05-07T20:33:19.9472443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9473072Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d44c0>}
2025-05-07T20:33:19.9474028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9474267Z context = <triton._C.libtriton.ir.context object at 0x7efacb6b8670>
2025-05-07T20:33:19.9474272Z 
2025-05-07T20:33:19.9474464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9474770Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9474900Z                            module_map=module_map)
2025-05-07T20:33:19.9475089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9475209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9475299Z E       ^
2025-05-07T20:33:19.9475701Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9475707Z 
2025-05-07T20:33:19.9476179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9476187Z 
2025-05-07T20:33:19.9476306Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9476565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9476657Z     T=128,
2025-05-07T20:33:19.9476750Z     D=5120,
2025-05-07T20:33:19.9476854Z     scale_ub=None,
2025-05-07T20:33:19.9476953Z     contiguous=True,
2025-05-07T20:33:19.9477050Z     compiled=False,
2025-05-07T20:33:19.9477149Z )
2025-05-07T20:33:19.9477397Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9477591Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9477606Z 
2025-05-07T20:33:19.9477696Z     @given(
2025-05-07T20:33:19.9477830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9477950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9478081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9478216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9478353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9478440Z     )
2025-05-07T20:33:19.9478771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9478889Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9478978Z         self,
2025-05-07T20:33:19.9479066Z         T: int,
2025-05-07T20:33:19.9479161Z         D: int,
2025-05-07T20:33:19.9479277Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9479388Z         contiguous: bool,
2025-05-07T20:33:19.9479487Z         compiled: bool,
2025-05-07T20:33:19.9479577Z     ) -> None:
2025-05-07T20:33:19.9479692Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9479775Z     
2025-05-07T20:33:19.9479967Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9480058Z     
2025-05-07T20:33:19.9480167Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9480309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9480468Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9480560Z         x0 = x[:, :D]
2025-05-07T20:33:19.9480652Z         x1 = x[:, D:]
2025-05-07T20:33:19.9480741Z     
2025-05-07T20:33:19.9480840Z         if contiguous:
2025-05-07T20:33:19.9480951Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9481058Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9481142Z     
2025-05-07T20:33:19.9481301Z         if scale_ub is not None:
2025-05-07T20:33:19.9481464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9481621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9481714Z             )
2025-05-07T20:33:19.9481803Z         else:
2025-05-07T20:33:19.9481911Z             scale_ub_tensor = None
2025-05-07T20:33:19.9482000Z     
2025-05-07T20:33:19.9482147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9482251Z             op = silu_mul_quant
2025-05-07T20:33:19.9482354Z             if compiled:
2025-05-07T20:33:19.9482473Z                 op = torch.compile(op)
2025-05-07T20:33:19.9482593Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9482682Z     
2025-05-07T20:33:19.9482789Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9482794Z 
2025-05-07T20:33:19.9482913Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9483060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9483179Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9483303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9483869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9483981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9484397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9484651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9485043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9485153Z     kernel = self.compile(
2025-05-07T20:33:19.9485586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9485793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9485942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9485947Z 
2025-05-07T20:33:19.9486187Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb639930>
2025-05-07T20:33:19.9487076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9487653Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d4940>}
2025-05-07T20:33:19.9488554Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9488776Z context = <triton._C.libtriton.ir.context object at 0x7efacb7bbb30>
2025-05-07T20:33:19.9488784Z 
2025-05-07T20:33:19.9488982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9489287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9489410Z                            module_map=module_map)
2025-05-07T20:33:19.9489603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9489719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9489815Z E       ^
2025-05-07T20:33:19.9490299Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9490304Z 
2025-05-07T20:33:19.9490773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9490778Z 
2025-05-07T20:33:19.9490904Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9491242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9491340Z     T=128,
2025-05-07T20:33:19.9491429Z     D=7168,
2025-05-07T20:33:19.9491523Z     scale_ub=None,
2025-05-07T20:33:19.9491628Z     contiguous=True,
2025-05-07T20:33:19.9491725Z     compiled=False,
2025-05-07T20:33:19.9491810Z )
2025-05-07T20:33:19.9492061Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9492255Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9492264Z 
2025-05-07T20:33:19.9492352Z     @given(
2025-05-07T20:33:19.9492494Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9492611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9492745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9492888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9493019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9493115Z     )
2025-05-07T20:33:19.9493397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9493505Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9493603Z         self,
2025-05-07T20:33:19.9493691Z         T: int,
2025-05-07T20:33:19.9493780Z         D: int,
2025-05-07T20:33:19.9493902Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9494010Z         contiguous: bool,
2025-05-07T20:33:19.9494108Z         compiled: bool,
2025-05-07T20:33:19.9494205Z     ) -> None:
2025-05-07T20:33:19.9494316Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9494399Z     
2025-05-07T20:33:19.9494598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9494683Z     
2025-05-07T20:33:19.9494796Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9494941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9495043Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9495141Z         x0 = x[:, :D]
2025-05-07T20:33:19.9495236Z         x1 = x[:, D:]
2025-05-07T20:33:19.9495322Z     
2025-05-07T20:33:19.9495426Z         if contiguous:
2025-05-07T20:33:19.9495535Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9495637Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9495728Z     
2025-05-07T20:33:19.9495832Z         if scale_ub is not None:
2025-05-07T20:33:19.9495952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9496116Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9496204Z             )
2025-05-07T20:33:19.9496300Z         else:
2025-05-07T20:33:19.9496408Z             scale_ub_tensor = None
2025-05-07T20:33:19.9496492Z     
2025-05-07T20:33:19.9496697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9496804Z             op = silu_mul_quant
2025-05-07T20:33:19.9496901Z             if compiled:
2025-05-07T20:33:19.9497022Z                 op = torch.compile(op)
2025-05-07T20:33:19.9497143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9497229Z     
2025-05-07T20:33:19.9497344Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9497349Z 
2025-05-07T20:33:19.9497460Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9497611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9497728Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9497845Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9498420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9498581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9498989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9499247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9499636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9499836Z     kernel = self.compile(
2025-05-07T20:33:19.9500270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9500469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9500622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9500627Z 
2025-05-07T20:33:19.9500865Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb6383d0>
2025-05-07T20:33:19.9501749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9502324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d5240>}
2025-05-07T20:33:19.9503169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9503397Z context = <triton._C.libtriton.ir.context object at 0x7efacb74f230>
2025-05-07T20:33:19.9503402Z 
2025-05-07T20:33:19.9503591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9503897Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9504024Z                            module_map=module_map)
2025-05-07T20:33:19.9504211Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9504331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9504423Z E       ^
2025-05-07T20:33:19.9504824Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9504838Z 
2025-05-07T20:33:19.9505306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9505311Z 
2025-05-07T20:33:19.9505429Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9505690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9505779Z     T=2048,
2025-05-07T20:33:19.9505868Z     D=7168,
2025-05-07T20:33:19.9505971Z     scale_ub=1200.0,
2025-05-07T20:33:19.9506073Z     contiguous=True,
2025-05-07T20:33:19.9506170Z     compiled=False,
2025-05-07T20:33:19.9506264Z )
2025-05-07T20:33:19.9506559Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9506766Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9506771Z 
2025-05-07T20:33:19.9506859Z     @given(
2025-05-07T20:33:19.9506993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9507123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9507255Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9507390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9507530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9507619Z     )
2025-05-07T20:33:19.9507900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9508016Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9508151Z         self,
2025-05-07T20:33:19.9508246Z         T: int,
2025-05-07T20:33:19.9508336Z         D: int,
2025-05-07T20:33:19.9508451Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9508564Z         contiguous: bool,
2025-05-07T20:33:19.9508668Z         compiled: bool,
2025-05-07T20:33:19.9508761Z     ) -> None:
2025-05-07T20:33:19.9508876Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9508961Z     
2025-05-07T20:33:19.9509197Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9511247Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9511257Z 
2025-05-07T20:33:19.9511393Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9511401Z 
2025-05-07T20:33:19.9511523Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9511774Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9511875Z     T=1,
2025-05-07T20:33:19.9511965Z     D=5120,
2025-05-07T20:33:19.9512064Z     scale_ub=1200.0,
2025-05-07T20:33:19.9512174Z     contiguous=True,
2025-05-07T20:33:19.9512272Z     compiled=False,
2025-05-07T20:33:19.9512358Z )
2025-05-07T20:33:19.9512608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9512796Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9512801Z 
2025-05-07T20:33:19.9512888Z     @given(
2025-05-07T20:33:19.9513032Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9513180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9513348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9513638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9513803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9513920Z     )
2025-05-07T20:33:19.9514266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9514404Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9514522Z         self,
2025-05-07T20:33:19.9514632Z         T: int,
2025-05-07T20:33:19.9514743Z         D: int,
2025-05-07T20:33:19.9514889Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9515017Z         contiguous: bool,
2025-05-07T20:33:19.9515138Z         compiled: bool,
2025-05-07T20:33:19.9515242Z     ) -> None:
2025-05-07T20:33:19.9515350Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9515442Z     
2025-05-07T20:33:19.9515634Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9515723Z     
2025-05-07T20:33:19.9515833Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9516030Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9516136Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9516235Z         x0 = x[:, :D]
2025-05-07T20:33:19.9516328Z         x1 = x[:, D:]
2025-05-07T20:33:19.9516410Z     
2025-05-07T20:33:19.9516514Z         if contiguous:
2025-05-07T20:33:19.9516619Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9516727Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9516817Z     
2025-05-07T20:33:19.9516922Z         if scale_ub is not None:
2025-05-07T20:33:19.9517042Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9517204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9517292Z             )
2025-05-07T20:33:19.9517385Z         else:
2025-05-07T20:33:19.9517493Z             scale_ub_tensor = None
2025-05-07T20:33:19.9517625Z     
2025-05-07T20:33:19.9517780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9517884Z             op = silu_mul_quant
2025-05-07T20:33:19.9517981Z             if compiled:
2025-05-07T20:33:19.9518107Z                 op = torch.compile(op)
2025-05-07T20:33:19.9518234Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9518317Z     
2025-05-07T20:33:19.9518426Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9518478Z 
2025-05-07T20:33:19.9518631Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9518787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9518903Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9519021Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9519595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9519707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9520118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9520380Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9520767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9520884Z     kernel = self.compile(
2025-05-07T20:33:19.9521325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9521526Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9521678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9521683Z 
2025-05-07T20:33:19.9521916Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb679a50>
2025-05-07T20:33:19.9522798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9523386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacb6d6200>}
2025-05-07T20:33:19.9524832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9525081Z context = <triton._C.libtriton.ir.context object at 0x7efacb745af0>
2025-05-07T20:33:19.9525087Z 
2025-05-07T20:33:19.9525280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9525585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9525707Z                            module_map=module_map)
2025-05-07T20:33:19.9525897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9526021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9526112Z E       ^
2025-05-07T20:33:19.9526728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9526742Z 
2025-05-07T20:33:19.9527211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9527218Z 
2025-05-07T20:33:19.9527338Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9527595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9527683Z     T=2048,
2025-05-07T20:33:19.9527772Z     D=5120,
2025-05-07T20:33:19.9527875Z     scale_ub=None,
2025-05-07T20:33:19.9527973Z     contiguous=True,
2025-05-07T20:33:19.9528073Z     compiled=False,
2025-05-07T20:33:19.9528164Z )
2025-05-07T20:33:19.9528479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9528681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9528689Z 
2025-05-07T20:33:19.9528775Z     @given(
2025-05-07T20:33:19.9528909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9529032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9529235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9529438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9529577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9529662Z     )
2025-05-07T20:33:19.9529944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9530052Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9530139Z         self,
2025-05-07T20:33:19.9530232Z         T: int,
2025-05-07T20:33:19.9530319Z         D: int,
2025-05-07T20:33:19.9530435Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9530543Z         contiguous: bool,
2025-05-07T20:33:19.9530640Z         compiled: bool,
2025-05-07T20:33:19.9530728Z     ) -> None:
2025-05-07T20:33:19.9530847Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9530930Z     
2025-05-07T20:33:19.9531121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9531210Z     
2025-05-07T20:33:19.9531320Z >       x_sign = torch.sign(x)
2025-05-07T20:33:19.9533308Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9533317Z 
2025-05-07T20:33:19.9533451Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:19.9533456Z 
2025-05-07T20:33:19.9533583Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9533833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9533922Z     T=16384,
2025-05-07T20:33:19.9534017Z     D=5120,
2025-05-07T20:33:19.9534111Z     scale_ub=None,
2025-05-07T20:33:19.9534211Z     contiguous=True,
2025-05-07T20:33:19.9534312Z     compiled=False,
2025-05-07T20:33:19.9534396Z )
2025-05-07T20:33:19.9534640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9534845Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9534850Z 
2025-05-07T20:33:19.9534937Z     @given(
2025-05-07T20:33:19.9535077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9535192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9535321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9535463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9535642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9535728Z     )
2025-05-07T20:33:19.9536012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9536121Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9536208Z         self,
2025-05-07T20:33:19.9536306Z         T: int,
2025-05-07T20:33:19.9536393Z         D: int,
2025-05-07T20:33:19.9536505Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9536613Z         contiguous: bool,
2025-05-07T20:33:19.9536710Z         compiled: bool,
2025-05-07T20:33:19.9536805Z     ) -> None:
2025-05-07T20:33:19.9536912Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9536995Z     
2025-05-07T20:33:19.9537191Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9539262Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9539308Z 
2025-05-07T20:33:19.9539448Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9539453Z 
2025-05-07T20:33:19.9539571Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9539821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9539918Z     T=4096,
2025-05-07T20:33:19.9540007Z     D=5120,
2025-05-07T20:33:19.9540103Z     scale_ub=None,
2025-05-07T20:33:19.9540209Z     contiguous=True,
2025-05-07T20:33:19.9540306Z     compiled=False,
2025-05-07T20:33:19.9540396Z )
2025-05-07T20:33:19.9540646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9540840Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9540844Z 
2025-05-07T20:33:19.9540939Z     @given(
2025-05-07T20:33:19.9541078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9541197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9541334Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9541466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9541595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9541689Z     )
2025-05-07T20:33:19.9541966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9542081Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9542172Z         self,
2025-05-07T20:33:19.9542260Z         T: int,
2025-05-07T20:33:19.9542355Z         D: int,
2025-05-07T20:33:19.9542468Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9542576Z         contiguous: bool,
2025-05-07T20:33:19.9542681Z         compiled: bool,
2025-05-07T20:33:19.9542771Z     ) -> None:
2025-05-07T20:33:19.9542879Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9542973Z     
2025-05-07T20:33:19.9543165Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9545162Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9545171Z 
2025-05-07T20:33:19.9545352Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9545357Z 
2025-05-07T20:33:19.9545481Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9545732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9545823Z     T=2048,
2025-05-07T20:33:19.9545919Z     D=5120,
2025-05-07T20:33:19.9546018Z     scale_ub=None,
2025-05-07T20:33:19.9546120Z     contiguous=False,
2025-05-07T20:33:19.9546224Z     compiled=False,
2025-05-07T20:33:19.9546308Z )
2025-05-07T20:33:19.9546551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9546755Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9546760Z 
2025-05-07T20:33:19.9546848Z     @given(
2025-05-07T20:33:19.9546988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9547148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9547279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9547421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9547551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9547641Z     )
2025-05-07T20:33:19.9547929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9548123Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9548213Z         self,
2025-05-07T20:33:19.9548307Z         T: int,
2025-05-07T20:33:19.9548394Z         D: int,
2025-05-07T20:33:19.9548508Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9548615Z         contiguous: bool,
2025-05-07T20:33:19.9548712Z         compiled: bool,
2025-05-07T20:33:19.9548806Z     ) -> None:
2025-05-07T20:33:19.9548914Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9549000Z     
2025-05-07T20:33:19.9549203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9551191Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9551201Z 
2025-05-07T20:33:19.9551341Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9551346Z 
2025-05-07T20:33:19.9551463Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9551711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9551809Z     T=4096,
2025-05-07T20:33:19.9551897Z     D=7168,
2025-05-07T20:33:19.9551991Z     scale_ub=None,
2025-05-07T20:33:19.9552099Z     contiguous=True,
2025-05-07T20:33:19.9552194Z     compiled=True,
2025-05-07T20:33:19.9552285Z )
2025-05-07T20:33:19.9552528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9552719Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.9552726Z 
2025-05-07T20:33:19.9552819Z     @given(
2025-05-07T20:33:19.9552956Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9553068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9553202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9553335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9553466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9553645Z     )
2025-05-07T20:33:19.9553923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9554040Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9554130Z         self,
2025-05-07T20:33:19.9554218Z         T: int,
2025-05-07T20:33:19.9554391Z         D: int,
2025-05-07T20:33:19.9554505Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9554620Z         contiguous: bool,
2025-05-07T20:33:19.9554735Z         compiled: bool,
2025-05-07T20:33:19.9554847Z     ) -> None:
2025-05-07T20:33:19.9554962Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9555051Z     
2025-05-07T20:33:19.9555245Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9557263Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9557319Z 
2025-05-07T20:33:19.9557455Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9557459Z 
2025-05-07T20:33:19.9557584Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9557886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9558041Z     T=2048,
2025-05-07T20:33:19.9558136Z     D=5120,
2025-05-07T20:33:19.9558232Z     scale_ub=1200.0,
2025-05-07T20:33:19.9558332Z     contiguous=False,
2025-05-07T20:33:19.9558436Z     compiled=False,
2025-05-07T20:33:19.9558520Z )
2025-05-07T20:33:19.9558763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9558967Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9558975Z 
2025-05-07T20:33:19.9559061Z     @given(
2025-05-07T20:33:19.9559201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9564667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9564833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9564982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9565115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9565208Z     )
2025-05-07T20:33:19.9565505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9565615Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9565706Z         self,
2025-05-07T20:33:19.9565805Z         T: int,
2025-05-07T20:33:19.9565893Z         D: int,
2025-05-07T20:33:19.9566007Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9566116Z         contiguous: bool,
2025-05-07T20:33:19.9566215Z         compiled: bool,
2025-05-07T20:33:19.9566315Z     ) -> None:
2025-05-07T20:33:19.9566425Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9566512Z     
2025-05-07T20:33:19.9566716Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9568786Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9568796Z 
2025-05-07T20:33:19.9568941Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9568946Z 
2025-05-07T20:33:19.9569064Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9569319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9569420Z     T=4096,
2025-05-07T20:33:19.9569510Z     D=7168,
2025-05-07T20:33:19.9569607Z     scale_ub=1200.0,
2025-05-07T20:33:19.9569792Z     contiguous=True,
2025-05-07T20:33:19.9569896Z     compiled=False,
2025-05-07T20:33:19.9569988Z )
2025-05-07T20:33:19.9570238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9570435Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9570443Z 
2025-05-07T20:33:19.9570536Z     @given(
2025-05-07T20:33:19.9570672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9570787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9570924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9571056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9571186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9571276Z     )
2025-05-07T20:33:19.9571613Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9571728Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9571816Z         self,
2025-05-07T20:33:19.9571908Z         T: int,
2025-05-07T20:33:19.9572001Z         D: int,
2025-05-07T20:33:19.9572114Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9572218Z         contiguous: bool,
2025-05-07T20:33:19.9572371Z         compiled: bool,
2025-05-07T20:33:19.9572464Z     ) -> None:
2025-05-07T20:33:19.9572615Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9572706Z     
2025-05-07T20:33:19.9572900Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9574906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9574916Z 
2025-05-07T20:33:19.9575050Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9575056Z 
2025-05-07T20:33:19.9575184Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9575439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9575529Z     T=16384,
2025-05-07T20:33:19.9575624Z     D=7168,
2025-05-07T20:33:19.9575720Z     scale_ub=None,
2025-05-07T20:33:19.9575820Z     contiguous=False,
2025-05-07T20:33:19.9575921Z     compiled=True,
2025-05-07T20:33:19.9576010Z )
2025-05-07T20:33:19.9576255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9576461Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.9576469Z 
2025-05-07T20:33:19.9576558Z     @given(
2025-05-07T20:33:19.9576697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9576813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9576943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9577085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9577218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9577307Z     )
2025-05-07T20:33:19.9577592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9577699Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9577787Z         self,
2025-05-07T20:33:19.9577881Z         T: int,
2025-05-07T20:33:19.9577972Z         D: int,
2025-05-07T20:33:19.9578083Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9578191Z         contiguous: bool,
2025-05-07T20:33:19.9578290Z         compiled: bool,
2025-05-07T20:33:19.9578387Z     ) -> None:
2025-05-07T20:33:19.9578496Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9578582Z     
2025-05-07T20:33:19.9578834Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9580826Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9580835Z 
2025-05-07T20:33:19.9580978Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9580983Z 
2025-05-07T20:33:19.9581102Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9581397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9581497Z     T=4096,
2025-05-07T20:33:19.9581586Z     D=7168,
2025-05-07T20:33:19.9581683Z     scale_ub=None,
2025-05-07T20:33:19.9581791Z     contiguous=True,
2025-05-07T20:33:19.9581891Z     compiled=False,
2025-05-07T20:33:19.9581985Z )
2025-05-07T20:33:19.9582232Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9582514Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9582520Z 
2025-05-07T20:33:19.9582617Z     @given(
2025-05-07T20:33:19.9582751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9582866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9583003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9583137Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9583268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9583366Z     )
2025-05-07T20:33:19.9583646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9583764Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9583854Z         self,
2025-05-07T20:33:19.9583943Z         T: int,
2025-05-07T20:33:19.9584040Z         D: int,
2025-05-07T20:33:19.9584155Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9584261Z         contiguous: bool,
2025-05-07T20:33:19.9584373Z         compiled: bool,
2025-05-07T20:33:19.9584478Z     ) -> None:
2025-05-07T20:33:19.9584601Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9584712Z     
2025-05-07T20:33:19.9584907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9586920Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9586930Z 
2025-05-07T20:33:19.9587064Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9587071Z 
2025-05-07T20:33:19.9587199Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9587451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9587542Z     T=16384,
2025-05-07T20:33:19.9587639Z     D=7168,
2025-05-07T20:33:19.9587735Z     scale_ub=None,
2025-05-07T20:33:19.9587835Z     contiguous=True,
2025-05-07T20:33:19.9587946Z     compiled=False,
2025-05-07T20:33:19.9588032Z )
2025-05-07T20:33:19.9588276Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9588484Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.9588489Z 
2025-05-07T20:33:19.9588579Z     @given(
2025-05-07T20:33:19.9588772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9588887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9589017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9589160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9589293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9589379Z     )
2025-05-07T20:33:19.9589667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9589775Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9589863Z         self,
2025-05-07T20:33:19.9589960Z         T: int,
2025-05-07T20:33:19.9590048Z         D: int,
2025-05-07T20:33:19.9590161Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9590318Z         contiguous: bool,
2025-05-07T20:33:19.9590417Z         compiled: bool,
2025-05-07T20:33:19.9590511Z     ) -> None:
2025-05-07T20:33:19.9590620Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9590707Z     
2025-05-07T20:33:19.9590903Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9592938Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9592985Z 
2025-05-07T20:33:19.9593128Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9593136Z 
2025-05-07T20:33:19.9593256Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9593616Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9593718Z     T=16384,
2025-05-07T20:33:19.9593808Z     D=7168,
2025-05-07T20:33:19.9593906Z     scale_ub=1200.0,
2025-05-07T20:33:19.9594009Z     contiguous=True,
2025-05-07T20:33:19.9594106Z     compiled=False,
2025-05-07T20:33:19.9594199Z )
2025-05-07T20:33:19.9594489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9594696Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9594701Z 
2025-05-07T20:33:19.9594797Z     @given(
2025-05-07T20:33:19.9594931Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9595045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9595186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9595319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9595454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9595545Z     )
2025-05-07T20:33:19.9595826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9595939Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9596028Z         self,
2025-05-07T20:33:19.9596116Z         T: int,
2025-05-07T20:33:19.9596211Z         D: int,
2025-05-07T20:33:19.9596322Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9596428Z         contiguous: bool,
2025-05-07T20:33:19.9596534Z         compiled: bool,
2025-05-07T20:33:19.9596624Z     ) -> None:
2025-05-07T20:33:19.9596735Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9596826Z     
2025-05-07T20:33:19.9597018Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9599075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9599088Z 
2025-05-07T20:33:19.9599226Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9599231Z 
2025-05-07T20:33:19.9599355Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9599607Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9599697Z     T=128,
2025-05-07T20:33:19.9599796Z     D=5120,
2025-05-07T20:33:19.9599895Z     scale_ub=1200.0,
2025-05-07T20:33:19.9599994Z     contiguous=False,
2025-05-07T20:33:19.9600097Z     compiled=False,
2025-05-07T20:33:19.9600231Z )
2025-05-07T20:33:19.9600477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9600680Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.9600688Z 
2025-05-07T20:33:19.9600778Z     @given(
2025-05-07T20:33:19.9600919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9601033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9601240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9601426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9601558Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9601649Z     )
2025-05-07T20:33:19.9601936Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9602044Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9602133Z         self,
2025-05-07T20:33:19.9602228Z         T: int,
2025-05-07T20:33:19.9602316Z         D: int,
2025-05-07T20:33:19.9602431Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9602541Z         contiguous: bool,
2025-05-07T20:33:19.9602640Z         compiled: bool,
2025-05-07T20:33:19.9602736Z     ) -> None:
2025-05-07T20:33:19.9602848Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9602933Z     
2025-05-07T20:33:19.9603131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9603220Z     
2025-05-07T20:33:19.9603327Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9603482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9603586Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9603679Z         x0 = x[:, :D]
2025-05-07T20:33:19.9603780Z         x1 = x[:, D:]
2025-05-07T20:33:19.9603864Z     
2025-05-07T20:33:19.9603961Z         if contiguous:
2025-05-07T20:33:19.9604075Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9604180Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9604265Z     
2025-05-07T20:33:19.9604377Z         if scale_ub is not None:
2025-05-07T20:33:19.9604503Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9604666Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9604755Z             )
2025-05-07T20:33:19.9604843Z         else:
2025-05-07T20:33:19.9604959Z             scale_ub_tensor = None
2025-05-07T20:33:19.9605042Z     
2025-05-07T20:33:19.9605192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9605306Z             op = silu_mul_quant
2025-05-07T20:33:19.9605407Z             if compiled:
2025-05-07T20:33:19.9605523Z                 op = torch.compile(op)
2025-05-07T20:33:19.9605652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9605739Z     
2025-05-07T20:33:19.9605843Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9605854Z 
2025-05-07T20:33:19.9605968Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9606114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9606243Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9606358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9606984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9607106Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9607516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9607783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9608171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9608280Z     kernel = self.compile(
2025-05-07T20:33:19.9608723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9608926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9609119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9609124Z 
2025-05-07T20:33:19.9609368Z self = <triton.compiler.compiler.ASTSource object at 0x7efacbaf99f0>
2025-05-07T20:33:19.9610286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9610916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbac5ea0>}
2025-05-07T20:33:19.9611759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9611987Z context = <triton._C.libtriton.ir.context object at 0x7efacb5deff0>
2025-05-07T20:33:19.9611996Z 
2025-05-07T20:33:19.9612183Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9612485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9612618Z                            module_map=module_map)
2025-05-07T20:33:19.9612805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9612924Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9613021Z E       ^
2025-05-07T20:33:19.9613423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9613428Z 
2025-05-07T20:33:19.9613901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9613906Z 
2025-05-07T20:33:19.9614026Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9614279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9614375Z     T=2048,
2025-05-07T20:33:19.9614465Z     D=7168,
2025-05-07T20:33:19.9614563Z     scale_ub=None,
2025-05-07T20:33:19.9614669Z     contiguous=False,
2025-05-07T20:33:19.9614767Z     compiled=False,
2025-05-07T20:33:19.9614857Z )
2025-05-07T20:33:19.9615101Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9615309Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.9615314Z 
2025-05-07T20:33:19.9615408Z     @given(
2025-05-07T20:33:19.9615542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9615656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9615798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9615932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9616073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9616161Z     )
2025-05-07T20:33:19.9616441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9616556Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9616730Z         self,
2025-05-07T20:33:19.9616820Z         T: int,
2025-05-07T20:33:19.9616916Z         D: int,
2025-05-07T20:33:19.9617030Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9617135Z         contiguous: bool,
2025-05-07T20:33:19.9617246Z         compiled: bool,
2025-05-07T20:33:19.9617339Z     ) -> None:
2025-05-07T20:33:19.9617447Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9617541Z     
2025-05-07T20:33:19.9617735Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9619744Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9619796Z 
2025-05-07T20:33:19.9619934Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9619986Z 
2025-05-07T20:33:19.9620152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9620408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9620500Z     T=128,
2025-05-07T20:33:19.9620596Z     D=7168,
2025-05-07T20:33:19.9620694Z     scale_ub=1200.0,
2025-05-07T20:33:19.9620793Z     contiguous=True,
2025-05-07T20:33:19.9620897Z     compiled=True,
2025-05-07T20:33:19.9620983Z )
2025-05-07T20:33:19.9621229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9621435Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9621440Z 
2025-05-07T20:33:19.9621530Z     @given(
2025-05-07T20:33:19.9621673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9621790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9621922Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9622064Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9622201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9622288Z     )
2025-05-07T20:33:19.9622574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9622682Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9622771Z         self,
2025-05-07T20:33:19.9622865Z         T: int,
2025-05-07T20:33:19.9622953Z         D: int,
2025-05-07T20:33:19.9623068Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9623178Z         contiguous: bool,
2025-05-07T20:33:19.9623278Z         compiled: bool,
2025-05-07T20:33:19.9623373Z     ) -> None:
2025-05-07T20:33:19.9623481Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9623564Z     
2025-05-07T20:33:19.9624108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9624244Z     
2025-05-07T20:33:19.9624394Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9624547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9624653Z         x = x_sign * x_clamp
2025-05-07T20:33:19.9624749Z         x0 = x[:, :D]
2025-05-07T20:33:19.9624846Z         x1 = x[:, D:]
2025-05-07T20:33:19.9624930Z     
2025-05-07T20:33:19.9625027Z         if contiguous:
2025-05-07T20:33:19.9625142Z             x0 = x0.contiguous()
2025-05-07T20:33:19.9625244Z             x1 = x1.contiguous()
2025-05-07T20:33:19.9625334Z     
2025-05-07T20:33:19.9625437Z         if scale_ub is not None:
2025-05-07T20:33:19.9625559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.9625723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.9625810Z             )
2025-05-07T20:33:19.9625900Z         else:
2025-05-07T20:33:19.9626014Z             scale_ub_tensor = None
2025-05-07T20:33:19.9626288Z     
2025-05-07T20:33:19.9626440Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.9626553Z             op = silu_mul_quant
2025-05-07T20:33:19.9626652Z             if compiled:
2025-05-07T20:33:19.9626771Z                 op = torch.compile(op)
2025-05-07T20:33:19.9626899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9626985Z     
2025-05-07T20:33:19.9627097Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.9627102Z 
2025-05-07T20:33:19.9627213Z moe/activation_test.py:117: 
2025-05-07T20:33:19.9627362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9627483Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.9627603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.9628095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.9628210Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.9628774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.9628892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.9629429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.9629686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.9630079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.9630187Z     kernel = self.compile(
2025-05-07T20:33:19.9630625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.9630836Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.9630981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.9630989Z 
2025-05-07T20:33:19.9631230Z self = <triton.compiler.compiler.ASTSource object at 0x7efacb483520>
2025-05-07T20:33:19.9632114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.9632692Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7efdfee86ef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7efacbac77f0>}
2025-05-07T20:33:19.9633621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.9633847Z context = <triton._C.libtriton.ir.context object at 0x7efacb4d5a70>
2025-05-07T20:33:19.9633852Z 
2025-05-07T20:33:19.9634050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.9634352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.9634483Z                            module_map=module_map)
2025-05-07T20:33:19.9634674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.9634791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.9634889Z E       ^
2025-05-07T20:33:19.9635292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.9635297Z 
2025-05-07T20:33:19.9635763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.9635768Z 
2025-05-07T20:33:19.9635898Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9636149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9636244Z     T=128,
2025-05-07T20:33:19.9636387Z     D=7168,
2025-05-07T20:33:19.9636499Z     scale_ub=1200.0,
2025-05-07T20:33:19.9636606Z     contiguous=True,
2025-05-07T20:33:19.9636705Z     compiled=False,
2025-05-07T20:33:19.9636789Z )
2025-05-07T20:33:19.9637043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9637244Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.9637249Z 
2025-05-07T20:33:19.9637344Z     @given(
2025-05-07T20:33:19.9637479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9637597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9637737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9637872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9638090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9638183Z     )
2025-05-07T20:33:19.9638467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9638579Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9638675Z         self,
2025-05-07T20:33:19.9638763Z         T: int,
2025-05-07T20:33:19.9638853Z         D: int,
2025-05-07T20:33:19.9639023Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9639127Z         contiguous: bool,
2025-05-07T20:33:19.9639277Z         compiled: bool,
2025-05-07T20:33:19.9639371Z     ) -> None:
2025-05-07T20:33:19.9639482Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9639576Z     
2025-05-07T20:33:19.9639773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9639861Z     
2025-05-07T20:33:19.9639978Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9640123Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9642130Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9642139Z 
2025-05-07T20:33:19.9642276Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.9642281Z 
2025-05-07T20:33:19.9642399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9642659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9642749Z     T=128,
2025-05-07T20:33:19.9642845Z     D=5120,
2025-05-07T20:33:19.9642943Z     scale_ub=1200.0,
2025-05-07T20:33:19.9643045Z     contiguous=True,
2025-05-07T20:33:19.9643147Z     compiled=True,
2025-05-07T20:33:19.9643231Z )
2025-05-07T20:33:19.9643477Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9643674Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.9643678Z 
2025-05-07T20:33:19.9643766Z     @given(
2025-05-07T20:33:19.9643900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9644026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9644156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9644295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9644425Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9644509Z     )
2025-05-07T20:33:19.9644792Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9644899Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9644991Z         self,
2025-05-07T20:33:19.9645085Z         T: int,
2025-05-07T20:33:19.9645174Z         D: int,
2025-05-07T20:33:19.9645287Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9645450Z         contiguous: bool,
2025-05-07T20:33:19.9645551Z         compiled: bool,
2025-05-07T20:33:19.9645642Z     ) -> None:
2025-05-07T20:33:19.9645756Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9645840Z     
2025-05-07T20:33:19.9646041Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9646133Z     
2025-05-07T20:33:19.9646240Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.9646389Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.9648379Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9648431Z 
2025-05-07T20:33:19.9648572Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.9648577Z 
2025-05-07T20:33:19.9648740Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.9649031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.9649128Z     T=128,
2025-05-07T20:33:19.9649216Z     D=7168,
2025-05-07T20:33:19.9649310Z     scale_ub=None,
2025-05-07T20:33:19.9649415Z     contiguous=True,
2025-05-07T20:33:19.9649511Z     compiled=True,
2025-05-07T20:33:19.9649595Z )
2025-05-07T20:33:19.9649844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.9650032Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.9650041Z 
2025-05-07T20:33:19.9650135Z     @given(
2025-05-07T20:33:19.9650268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.9650383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.9650520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.9650657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.9650791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.9650881Z     )
2025-05-07T20:33:19.9651166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.9651280Z     def test_silu_mul_quant(
2025-05-07T20:33:19.9651369Z         self,
2025-05-07T20:33:19.9651456Z         T: int,
2025-05-07T20:33:19.9651553Z         D: int,
2025-05-07T20:33:19.9651667Z         scale_ub: Optional[float],
2025-05-07T20:33:19.9651769Z         contiguous: bool,
2025-05-07T20:33:19.9651872Z         compiled: bool,
2025-05-07T20:33:19.9651966Z     ) -> None:
2025-05-07T20:33:19.9652075Z         torch.manual_seed(2025)
2025-05-07T20:33:19.9652165Z     
2025-05-07T20:33:19.9652358Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.9654370Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.9654380Z 
2025-05-07T20:33:19.9654514Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.9654669Z =============================== warnings summary ===============================
2025-05-07T20:33:19.9655028Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.9655423Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.9655775Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.9656767Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:19.9657036Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:19.9657041Z 
2025-05-07T20:33:19.9657280Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:19.9657473Z ================= 1 failed, 1 deselected, 3 warnings in 18.35s =================
2025-05-07T20:33:21.5929004Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:21.6581081Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:21.6581518Z 
2025-05-07T20:33:23.6601063Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:25.8305104Z ============================= test session starts ==============================
2025-05-07T20:33:25.8306258Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:25.8307188Z cachedir: .pytest_cache
2025-05-07T20:33:25.8308148Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:25.8309469Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:25.8310216Z plugins: hypothesis-6.131.14
2025-05-07T20:33:27.4795049Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:27.6735760Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:27.6736256Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:27.6736504Z 
2025-05-07T20:33:30.3771701Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3772701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3773171Z     T=1,
2025-05-07T20:33:30.3773385Z     D=5120,
2025-05-07T20:33:30.3773688Z     scale_ub=None,
2025-05-07T20:33:30.3773944Z     contiguous=True,
2025-05-07T20:33:30.3774198Z     compiled=True,
2025-05-07T20:33:30.3774463Z )
2025-05-07T20:33:30.3774829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.3775381Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:30.3775681Z 
2025-05-07T20:33:30.3775773Z     @given(
2025-05-07T20:33:30.3776039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.3776396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.3776750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.3777133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.3777510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.3777830Z     )
2025-05-07T20:33:30.3778232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.3778733Z     def test_silu_mul_quant(
2025-05-07T20:33:30.3779015Z         self,
2025-05-07T20:33:30.3779237Z         T: int,
2025-05-07T20:33:30.3779466Z         D: int,
2025-05-07T20:33:30.3779724Z         scale_ub: Optional[float],
2025-05-07T20:33:30.3780029Z         contiguous: bool,
2025-05-07T20:33:30.3780305Z         compiled: bool,
2025-05-07T20:33:30.3780565Z     ) -> None:
2025-05-07T20:33:30.3781189Z         torch.manual_seed(2025)
2025-05-07T20:33:30.3781471Z     
2025-05-07T20:33:30.3781786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.3782166Z     
2025-05-07T20:33:30.3782396Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.3782734Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.3783083Z         x = x_sign * x_clamp
2025-05-07T20:33:30.3783359Z         x0 = x[:, :D]
2025-05-07T20:33:30.3783612Z         x1 = x[:, D:]
2025-05-07T20:33:30.3783843Z     
2025-05-07T20:33:30.3784056Z         if contiguous:
2025-05-07T20:33:30.3784322Z             x0 = x0.contiguous()
2025-05-07T20:33:30.3784613Z             x1 = x1.contiguous()
2025-05-07T20:33:30.3784892Z     
2025-05-07T20:33:30.3785115Z         if scale_ub is not None:
2025-05-07T20:33:30.3785543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.3785920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.3786282Z             )
2025-05-07T20:33:30.3786510Z         else:
2025-05-07T20:33:30.3786750Z             scale_ub_tensor = None
2025-05-07T20:33:30.3787038Z     
2025-05-07T20:33:30.3787305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3787753Z             op = silu_mul_quant
2025-05-07T20:33:30.3788129Z             if compiled:
2025-05-07T20:33:30.3788417Z                 op = torch.compile(op)
2025-05-07T20:33:30.3788746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3789064Z     
2025-05-07T20:33:30.3789286Z         y_fp8, y_scale = fn()
2025-05-07T20:33:30.3789606Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:30.3789941Z     
2025-05-07T20:33:30.3790213Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3790597Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:30.3790923Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:30.3791285Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:30.3791688Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:30.3792039Z     
2025-05-07T20:33:30.3792270Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:30.3792493Z 
2025-05-07T20:33:30.3792616Z moe/activation_test.py:126: 
2025-05-07T20:33:30.3792952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3793331Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:30.3793865Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:30.3794759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:30.3795597Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:30.3796217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.3796992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.3797761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:30.3798578Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:30.3799434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:30.3800278Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:30.3801090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:30.3801809Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:30.3802486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:30.3803070Z     fn()
2025-05-07T20:33:30.3803705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:30.3804364Z     self.fn.run(
2025-05-07T20:33:30.3804894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.3805490Z     kernel = self.compile(
2025-05-07T20:33:30.3806100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.3806842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.3807291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3807549Z 
2025-05-07T20:33:30.3807783Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2afa7eb0>
2025-05-07T20:33:30.3809061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.3810632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2b098af0>}
2025-05-07T20:33:30.3812268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.3813418Z context = <triton._C.libtriton.ir.context object at 0x7fae30ecce30>
2025-05-07T20:33:30.3813742Z 
2025-05-07T20:33:30.3813933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.3821767Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.3822473Z                            module_map=module_map)
2025-05-07T20:33:30.3822903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.3823308Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:30.3823617Z E       ^
2025-05-07T20:33:30.3824415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.3824926Z 
2025-05-07T20:33:30.3825406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.3825988Z 
2025-05-07T20:33:30.3826108Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3826576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3827056Z     T=2048,
2025-05-07T20:33:30.3827283Z     D=5120,
2025-05-07T20:33:30.3827516Z     scale_ub=1200.0,
2025-05-07T20:33:30.3827777Z     contiguous=True,
2025-05-07T20:33:30.3828023Z     compiled=False,
2025-05-07T20:33:30.3828262Z )
2025-05-07T20:33:31.7998144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.7998838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.7999130Z 
2025-05-07T20:33:31.7999215Z     @given(
2025-05-07T20:33:31.7999517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.7999857Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.8000180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.8000528Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.8000870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.8001172Z     )
2025-05-07T20:33:31.8001544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.8002003Z     def test_silu_mul_quant(
2025-05-07T20:33:31.8002267Z         self,
2025-05-07T20:33:31.8002477Z         T: int,
2025-05-07T20:33:31.8002684Z         D: int,
2025-05-07T20:33:31.8002916Z         scale_ub: Optional[float],
2025-05-07T20:33:31.8003488Z         contiguous: bool,
2025-05-07T20:33:31.8003745Z         compiled: bool,
2025-05-07T20:33:31.8003988Z     ) -> None:
2025-05-07T20:33:31.8004221Z         torch.manual_seed(2025)
2025-05-07T20:33:31.8004474Z     
2025-05-07T20:33:31.8004769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.8005135Z     
2025-05-07T20:33:31.8005343Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.8005648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.8005979Z         x = x_sign * x_clamp
2025-05-07T20:33:31.8006237Z         x0 = x[:, :D]
2025-05-07T20:33:31.8006463Z         x1 = x[:, D:]
2025-05-07T20:33:31.8006684Z     
2025-05-07T20:33:31.8006883Z         if contiguous:
2025-05-07T20:33:31.8007124Z             x0 = x0.contiguous()
2025-05-07T20:33:31.8007542Z             x1 = x1.contiguous()
2025-05-07T20:33:31.8007800Z     
2025-05-07T20:33:31.8008000Z         if scale_ub is not None:
2025-05-07T20:33:31.8008297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.8008654Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.8008975Z             )
2025-05-07T20:33:31.8009183Z         else:
2025-05-07T20:33:31.8009408Z             scale_ub_tensor = None
2025-05-07T20:33:31.8009748Z     
2025-05-07T20:33:31.8010103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.8010440Z             op = silu_mul_quant
2025-05-07T20:33:31.8010709Z             if compiled:
2025-05-07T20:33:31.8010970Z                 op = torch.compile(op)
2025-05-07T20:33:31.8011289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.8011587Z     
2025-05-07T20:33:31.8011787Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.8011968Z 
2025-05-07T20:33:31.8012074Z moe/activation_test.py:117: 
2025-05-07T20:33:31.8012394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.8012740Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.8013042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.8013774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.8014506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.8015068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.8015784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.8016482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.8017039Z     kernel = self.compile(
2025-05-07T20:33:31.8017610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.8018308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.8018726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.8018968Z 
2025-05-07T20:33:31.8019187Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2a089960>
2025-05-07T20:33:31.8020323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.8021789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2af71990>}
2025-05-07T20:33:31.8023198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.8024654Z context = <triton._C.libtriton.ir.context object at 0x7fae2b680030>
2025-05-07T20:33:31.8024958Z 
2025-05-07T20:33:31.8025211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.8025759Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.8026258Z                            module_map=module_map)
2025-05-07T20:33:31.8026640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.8027031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.8027339Z E       ^
2025-05-07T20:33:31.8027827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.8028296Z 
2025-05-07T20:33:31.8028729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.8029338Z 
2025-05-07T20:33:31.8029447Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.8029886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.8030305Z     T=2048,
2025-05-07T20:33:31.8030507Z     D=5120,
2025-05-07T20:33:31.8030713Z     scale_ub=1200.0,
2025-05-07T20:33:31.8030950Z     contiguous=True,
2025-05-07T20:33:31.8031177Z     compiled=True,
2025-05-07T20:33:31.8031467Z )
2025-05-07T20:33:31.8031911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.8032423Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.8032715Z 
2025-05-07T20:33:31.8032797Z     @given(
2025-05-07T20:33:31.8033043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.8033365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.8033752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.8034105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.8034446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.8034748Z     )
2025-05-07T20:33:31.8035123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.8035588Z     def test_silu_mul_quant(
2025-05-07T20:33:31.8035838Z         self,
2025-05-07T20:33:31.8036046Z         T: int,
2025-05-07T20:33:31.8036263Z         D: int,
2025-05-07T20:33:31.8036490Z         scale_ub: Optional[float],
2025-05-07T20:33:31.8036780Z         contiguous: bool,
2025-05-07T20:33:31.8037034Z         compiled: bool,
2025-05-07T20:33:31.8037268Z     ) -> None:
2025-05-07T20:33:31.8037499Z         torch.manual_seed(2025)
2025-05-07T20:33:31.8037754Z     
2025-05-07T20:33:31.8038036Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.8038392Z     
2025-05-07T20:33:31.8038597Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.8038894Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.8039222Z         x = x_sign * x_clamp
2025-05-07T20:33:31.8039476Z         x0 = x[:, :D]
2025-05-07T20:33:31.8039700Z         x1 = x[:, D:]
2025-05-07T20:33:31.8039923Z     
2025-05-07T20:33:31.8040123Z         if contiguous:
2025-05-07T20:33:31.8040362Z             x0 = x0.contiguous()
2025-05-07T20:33:31.8040635Z             x1 = x1.contiguous()
2025-05-07T20:33:31.8040888Z     
2025-05-07T20:33:31.8041096Z         if scale_ub is not None:
2025-05-07T20:33:31.8041383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.8041736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.8042059Z             )
2025-05-07T20:33:31.8042257Z         else:
2025-05-07T20:33:31.8042480Z             scale_ub_tensor = None
2025-05-07T20:33:31.8042747Z     
2025-05-07T20:33:31.8042985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.8043314Z             op = silu_mul_quant
2025-05-07T20:33:31.8043579Z             if compiled:
2025-05-07T20:33:31.8043840Z                 op = torch.compile(op)
2025-05-07T20:33:31.8044152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.8044441Z     
2025-05-07T20:33:31.8044692Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.8044997Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.8045305Z     
2025-05-07T20:33:31.8045559Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.8045909Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.8046229Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.8046558Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.8046929Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.8047259Z     
2025-05-07T20:33:31.8047476Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.8047680Z 
2025-05-07T20:33:31.8047786Z moe/activation_test.py:126: 
2025-05-07T20:33:31.8048105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.8048502Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.8048851Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.8049667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.8050449Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.8051101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.8051807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.8052524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.8053280Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.8054064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:31.8054840Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.8055598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.8056261Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.8056892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.8057481Z     fn()
2025-05-07T20:33:31.8058007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.8058613Z     self.fn.run(
2025-05-07T20:33:31.8059096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.8059652Z     kernel = self.compile(
2025-05-07T20:33:31.8060222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.8060907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.8061316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.8061564Z 
2025-05-07T20:33:31.8061781Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2afa7b20>
2025-05-07T20:33:31.8062905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.8064328Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a196c0>}
2025-05-07T20:33:31.8065716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.8066830Z context = <triton._C.libtriton.ir.context object at 0x7fae2992da70>
2025-05-07T20:33:31.8067138Z 
2025-05-07T20:33:31.8067312Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.8067870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.8068427Z                            module_map=module_map)
2025-05-07T20:33:31.8068813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.8069191Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.8069492Z E       ^
2025-05-07T20:33:31.8069982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.8070456Z 
2025-05-07T20:33:31.8070937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.8071477Z 
2025-05-07T20:33:31.8071594Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.8072032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.8072457Z     T=16384,
2025-05-07T20:33:31.8072662Z     D=7168,
2025-05-07T20:33:31.8072913Z     scale_ub=1200.0,
2025-05-07T20:33:31.8073155Z     contiguous=False,
2025-05-07T20:33:31.8073429Z     compiled=False,
2025-05-07T20:33:31.8073721Z )
2025-05-07T20:33:33.0106554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0107111Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.0107414Z 
2025-05-07T20:33:33.0108088Z     @given(
2025-05-07T20:33:33.0108481Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0108933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0109419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0109777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0110132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0110442Z     )
2025-05-07T20:33:33.0110819Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0111304Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0111576Z         self,
2025-05-07T20:33:33.0111799Z         T: int,
2025-05-07T20:33:33.0112014Z         D: int,
2025-05-07T20:33:33.0112244Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0112544Z         contiguous: bool,
2025-05-07T20:33:33.0112805Z         compiled: bool,
2025-05-07T20:33:33.0113044Z     ) -> None:
2025-05-07T20:33:33.0113283Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0113664Z     
2025-05-07T20:33:33.0113955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0114341Z     
2025-05-07T20:33:33.0114552Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0114859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0115192Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0115456Z         x0 = x[:, :D]
2025-05-07T20:33:33.0115691Z         x1 = x[:, D:]
2025-05-07T20:33:33.0115920Z     
2025-05-07T20:33:33.0116121Z         if contiguous:
2025-05-07T20:33:33.0116362Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0116643Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0116915Z     
2025-05-07T20:33:33.0117131Z         if scale_ub is not None:
2025-05-07T20:33:33.0117424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0117841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0118177Z             )
2025-05-07T20:33:33.0118387Z         else:
2025-05-07T20:33:33.0118618Z             scale_ub_tensor = None
2025-05-07T20:33:33.0118890Z     
2025-05-07T20:33:33.0119140Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0119478Z             op = silu_mul_quant
2025-05-07T20:33:33.0119778Z             if compiled:
2025-05-07T20:33:33.0120043Z                 op = torch.compile(op)
2025-05-07T20:33:33.0120697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0120988Z     
2025-05-07T20:33:33.0121240Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.0121484Z 
2025-05-07T20:33:33.0121628Z moe/activation_test.py:117: 
2025-05-07T20:33:33.0121947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0122300Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.0122612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0123352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.0124619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.0125206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0126096Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0126802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0127407Z     kernel = self.compile(
2025-05-07T20:33:33.0128013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0128898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0129320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0129570Z 
2025-05-07T20:33:33.0129791Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29d6b100>
2025-05-07T20:33:33.0130936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0132611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a18940>}
2025-05-07T20:33:33.0134046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0135138Z context = <triton._C.libtriton.ir.context object at 0x7fae29996db0>
2025-05-07T20:33:33.0135448Z 
2025-05-07T20:33:33.0135626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0136184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0136677Z                            module_map=module_map)
2025-05-07T20:33:33.0137069Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0137476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.0137782Z E       ^
2025-05-07T20:33:33.0138273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0138752Z 
2025-05-07T20:33:33.0139190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0139733Z 
2025-05-07T20:33:33.0139853Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0140289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0140713Z     T=1,
2025-05-07T20:33:33.0140916Z     D=7168,
2025-05-07T20:33:33.0141130Z     scale_ub=None,
2025-05-07T20:33:33.0141357Z     contiguous=True,
2025-05-07T20:33:33.0141600Z     compiled=True,
2025-05-07T20:33:33.0141830Z )
2025-05-07T20:33:33.0142167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.0142684Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:33.0142956Z 
2025-05-07T20:33:33.0143048Z     @given(
2025-05-07T20:33:33.0143376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.0143715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.0144043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.0144395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.0144751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.0145057Z     )
2025-05-07T20:33:33.0145439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.0145904Z     def test_silu_mul_quant(
2025-05-07T20:33:33.0146164Z         self,
2025-05-07T20:33:33.0146378Z         T: int,
2025-05-07T20:33:33.0146585Z         D: int,
2025-05-07T20:33:33.0146824Z         scale_ub: Optional[float],
2025-05-07T20:33:33.0147115Z         contiguous: bool,
2025-05-07T20:33:33.0147418Z         compiled: bool,
2025-05-07T20:33:33.0147665Z     ) -> None:
2025-05-07T20:33:33.0147902Z         torch.manual_seed(2025)
2025-05-07T20:33:33.0148158Z     
2025-05-07T20:33:33.0148455Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.0148822Z     
2025-05-07T20:33:33.0149026Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.0149338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.0149757Z         x = x_sign * x_clamp
2025-05-07T20:33:33.0150012Z         x0 = x[:, :D]
2025-05-07T20:33:33.0150247Z         x1 = x[:, D:]
2025-05-07T20:33:33.0150475Z     
2025-05-07T20:33:33.0150678Z         if contiguous:
2025-05-07T20:33:33.0150925Z             x0 = x0.contiguous()
2025-05-07T20:33:33.0151203Z             x1 = x1.contiguous()
2025-05-07T20:33:33.0151461Z     
2025-05-07T20:33:33.0151662Z         if scale_ub is not None:
2025-05-07T20:33:33.0151960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.0152323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.0152652Z             )
2025-05-07T20:33:33.0152868Z         else:
2025-05-07T20:33:33.0153098Z             scale_ub_tensor = None
2025-05-07T20:33:33.0153366Z     
2025-05-07T20:33:33.0153700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0154038Z             op = silu_mul_quant
2025-05-07T20:33:33.0154306Z             if compiled:
2025-05-07T20:33:33.0154578Z                 op = torch.compile(op)
2025-05-07T20:33:33.0154898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.0155193Z     
2025-05-07T20:33:33.0155405Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.0155712Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.0156023Z     
2025-05-07T20:33:33.0156279Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.0156636Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.0156952Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.0157286Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.0157670Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0158006Z     
2025-05-07T20:33:33.0158220Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.0158432Z 
2025-05-07T20:33:33.0158539Z moe/activation_test.py:126: 
2025-05-07T20:33:33.0158859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0159224Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.0159567Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.0160395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.0161188Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.0161759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.0162482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.0163322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.0164087Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0164881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:33.0165670Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.0166440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.0167113Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.0167791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.0168394Z     fn()
2025-05-07T20:33:33.0168934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.0169541Z     self.fn.run(
2025-05-07T20:33:33.0170037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.0170647Z     kernel = self.compile(
2025-05-07T20:33:33.0171263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.0171949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.0172373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.0172616Z 
2025-05-07T20:33:33.0172840Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29bb0f70>
2025-05-07T20:33:33.0173972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.0175413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b0790>}
2025-05-07T20:33:33.0176825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.0177951Z context = <triton._C.libtriton.ir.context object at 0x7fae298e8cb0>
2025-05-07T20:33:33.0178254Z 
2025-05-07T20:33:33.0178439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.0178988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.0179491Z                            module_map=module_map)
2025-05-07T20:33:33.0179882Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.0180267Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.0180548Z E       ^
2025-05-07T20:33:33.0181046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.0181521Z 
2025-05-07T20:33:33.0181976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.0182529Z 
2025-05-07T20:33:33.0182647Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.0189559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.0190001Z     T=4096,
2025-05-07T20:33:33.0190211Z     D=5120,
2025-05-07T20:33:33.0190419Z     scale_ub=None,
2025-05-07T20:33:33.0190655Z     contiguous=False,
2025-05-07T20:33:33.0190902Z     compiled=False,
2025-05-07T20:33:33.0191130Z )
2025-05-07T20:33:34.6087807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.6088850Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.6089294Z 
2025-05-07T20:33:34.6089429Z     @given(
2025-05-07T20:33:34.6089786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.6090261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.6090630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.6091007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.6091365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.6091683Z     )
2025-05-07T20:33:34.6092075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.6092560Z     def test_silu_mul_quant(
2025-05-07T20:33:34.6092829Z         self,
2025-05-07T20:33:34.6093045Z         T: int,
2025-05-07T20:33:34.6093347Z         D: int,
2025-05-07T20:33:34.6093587Z         scale_ub: Optional[float],
2025-05-07T20:33:34.6093888Z         contiguous: bool,
2025-05-07T20:33:34.6094152Z         compiled: bool,
2025-05-07T20:33:34.6094407Z     ) -> None:
2025-05-07T20:33:34.6094648Z         torch.manual_seed(2025)
2025-05-07T20:33:34.6094909Z     
2025-05-07T20:33:34.6095249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.6095741Z     
2025-05-07T20:33:34.6095955Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.6096349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.6096692Z         x = x_sign * x_clamp
2025-05-07T20:33:34.6096959Z         x0 = x[:, :D]
2025-05-07T20:33:34.6097199Z         x1 = x[:, D:]
2025-05-07T20:33:34.6097428Z     
2025-05-07T20:33:34.6097636Z         if contiguous:
2025-05-07T20:33:34.6097906Z             x0 = x0.contiguous()
2025-05-07T20:33:34.6098230Z             x1 = x1.contiguous()
2025-05-07T20:33:34.6098498Z     
2025-05-07T20:33:34.6098715Z         if scale_ub is not None:
2025-05-07T20:33:34.6099016Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.6099388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.6099731Z             )
2025-05-07T20:33:34.6099944Z         else:
2025-05-07T20:33:34.6100177Z             scale_ub_tensor = None
2025-05-07T20:33:34.6100456Z     
2025-05-07T20:33:34.6100710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.6101064Z             op = silu_mul_quant
2025-05-07T20:33:34.6101341Z             if compiled:
2025-05-07T20:33:34.6101616Z                 op = torch.compile(op)
2025-05-07T20:33:34.6101937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.6102239Z     
2025-05-07T20:33:34.6102454Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.6102637Z 
2025-05-07T20:33:34.6102747Z moe/activation_test.py:117: 
2025-05-07T20:33:34.6103074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6103445Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.6103757Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.6104525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.6105292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.6105886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.6106645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.6107378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.6107968Z     kernel = self.compile(
2025-05-07T20:33:34.6108597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.6109346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.6109790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6110042Z 
2025-05-07T20:33:34.6110330Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29794c10>
2025-05-07T20:33:34.6111528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.6113049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b1510>}
2025-05-07T20:33:34.6114614Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.6115795Z context = <triton._C.libtriton.ir.context object at 0x7fae296b7a70>
2025-05-07T20:33:34.6116112Z 
2025-05-07T20:33:34.6116303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.6116875Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.6117395Z                            module_map=module_map)
2025-05-07T20:33:34.6117844Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.6118269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.6118554Z E       ^
2025-05-07T20:33:34.6119120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.6119617Z 
2025-05-07T20:33:34.6120076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.6120642Z 
2025-05-07T20:33:34.6120761Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.6121215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.6121661Z     T=4096,
2025-05-07T20:33:34.6121868Z     D=7168,
2025-05-07T20:33:34.6122081Z     scale_ub=None,
2025-05-07T20:33:34.6122322Z     contiguous=False,
2025-05-07T20:33:34.6122574Z     compiled=False,
2025-05-07T20:33:34.6122797Z )
2025-05-07T20:33:34.6123150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.6123706Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.6124205Z 
2025-05-07T20:33:34.6124291Z     @given(
2025-05-07T20:33:34.6124546Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.6124890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.6125230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.6125588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.6125954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.6126270Z     )
2025-05-07T20:33:34.6126654Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.6127143Z     def test_silu_mul_quant(
2025-05-07T20:33:34.6127410Z         self,
2025-05-07T20:33:34.6127624Z         T: int,
2025-05-07T20:33:34.6127844Z         D: int,
2025-05-07T20:33:34.6128088Z         scale_ub: Optional[float],
2025-05-07T20:33:34.6128386Z         contiguous: bool,
2025-05-07T20:33:34.6128653Z         compiled: bool,
2025-05-07T20:33:34.6128900Z     ) -> None:
2025-05-07T20:33:34.6129136Z         torch.manual_seed(2025)
2025-05-07T20:33:34.6129403Z     
2025-05-07T20:33:34.6129702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.6130080Z     
2025-05-07T20:33:34.6130300Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.6130622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.6130970Z         x = x_sign * x_clamp
2025-05-07T20:33:34.6131237Z         x0 = x[:, :D]
2025-05-07T20:33:34.6131478Z         x1 = x[:, D:]
2025-05-07T20:33:34.6131707Z     
2025-05-07T20:33:34.6131907Z         if contiguous:
2025-05-07T20:33:34.6132508Z             x0 = x0.contiguous()
2025-05-07T20:33:34.6132801Z             x1 = x1.contiguous()
2025-05-07T20:33:34.6133061Z     
2025-05-07T20:33:34.6133270Z         if scale_ub is not None:
2025-05-07T20:33:34.6133572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.6133940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.6134279Z             )
2025-05-07T20:33:34.6134490Z         else:
2025-05-07T20:33:34.6134720Z             scale_ub_tensor = None
2025-05-07T20:33:34.6134999Z     
2025-05-07T20:33:34.6135254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.6135593Z             op = silu_mul_quant
2025-05-07T20:33:34.6135869Z             if compiled:
2025-05-07T20:33:34.6136141Z                 op = torch.compile(op)
2025-05-07T20:33:34.6136548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.6136847Z     
2025-05-07T20:33:34.6137061Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.6137243Z 
2025-05-07T20:33:34.6137359Z moe/activation_test.py:117: 
2025-05-07T20:33:34.6137681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6138053Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.6138434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.6139257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.6140021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.6140617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.6141370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.6142101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.6142693Z     kernel = self.compile(
2025-05-07T20:33:34.6143294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.6144023Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.6144458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6144718Z 
2025-05-07T20:33:34.6144951Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29557760>
2025-05-07T20:33:34.6146142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.6147659Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b1bd0>}
2025-05-07T20:33:34.6149195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.6150326Z context = <triton._C.libtriton.ir.context object at 0x7fae29666c30>
2025-05-07T20:33:34.6150652Z 
2025-05-07T20:33:34.6150837Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.6151410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.6151926Z                            module_map=module_map)
2025-05-07T20:33:34.6152329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.6152714Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.6152995Z E       ^
2025-05-07T20:33:34.6153585Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.6154092Z 
2025-05-07T20:33:34.6154604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.6155175Z 
2025-05-07T20:33:34.6155296Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.6155749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.6156194Z     T=128,
2025-05-07T20:33:34.6156403Z     D=7168,
2025-05-07T20:33:34.6156608Z     scale_ub=None,
2025-05-07T20:33:34.6156844Z     contiguous=False,
2025-05-07T20:33:34.6157092Z     compiled=True,
2025-05-07T20:33:34.6157309Z )
2025-05-07T20:33:34.6829121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.6829978Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:34.6830403Z 
2025-05-07T20:33:34.6830531Z     @given(
2025-05-07T20:33:34.6830944Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.6831290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.6831637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.6832013Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.6832379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.6832700Z     )
2025-05-07T20:33:34.6833091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.6833846Z     def test_silu_mul_quant(
2025-05-07T20:33:34.6834114Z         self,
2025-05-07T20:33:34.6834339Z         T: int,
2025-05-07T20:33:34.6834561Z         D: int,
2025-05-07T20:33:34.6834803Z         scale_ub: Optional[float],
2025-05-07T20:33:34.6835101Z         contiguous: bool,
2025-05-07T20:33:34.6835369Z         compiled: bool,
2025-05-07T20:33:34.6835620Z     ) -> None:
2025-05-07T20:33:34.6835860Z         torch.manual_seed(2025)
2025-05-07T20:33:34.6836125Z     
2025-05-07T20:33:34.6836432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.6836813Z     
2025-05-07T20:33:34.6837026Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.6837358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.6837705Z         x = x_sign * x_clamp
2025-05-07T20:33:34.6837971Z         x0 = x[:, :D]
2025-05-07T20:33:34.6838212Z         x1 = x[:, D:]
2025-05-07T20:33:34.6838451Z     
2025-05-07T20:33:34.6838657Z         if contiguous:
2025-05-07T20:33:34.6838920Z             x0 = x0.contiguous()
2025-05-07T20:33:34.6839215Z             x1 = x1.contiguous()
2025-05-07T20:33:34.6839486Z     
2025-05-07T20:33:34.6839708Z         if scale_ub is not None:
2025-05-07T20:33:34.6840024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.6840406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.6840757Z             )
2025-05-07T20:33:34.6840978Z         else:
2025-05-07T20:33:34.6841219Z             scale_ub_tensor = None
2025-05-07T20:33:34.6841504Z     
2025-05-07T20:33:34.6841769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.6842125Z             op = silu_mul_quant
2025-05-07T20:33:34.6842411Z             if compiled:
2025-05-07T20:33:34.6842694Z                 op = torch.compile(op)
2025-05-07T20:33:34.6843031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.6843337Z     
2025-05-07T20:33:34.6843561Z         y_fp8, y_scale = fn()
2025-05-07T20:33:34.6843890Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:34.6844214Z     
2025-05-07T20:33:34.6844489Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.6844871Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:34.6845197Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:34.6845553Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:34.6845962Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:34.6846317Z     
2025-05-07T20:33:34.6846545Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:34.6846774Z 
2025-05-07T20:33:34.6846887Z moe/activation_test.py:126: 
2025-05-07T20:33:34.6847304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6847680Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:34.6848070Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:34.6848996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:34.6849849Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:34.6850464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.6851239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.6852021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:34.6852915Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:34.6853776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:34.6854621Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:34.6855538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:34.6856258Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:34.6856939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:34.6857528Z     fn()
2025-05-07T20:33:34.6858105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:34.6858779Z     self.fn.run(
2025-05-07T20:33:34.6859306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.6859904Z     kernel = self.compile(
2025-05-07T20:33:34.6860504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.6861237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.6861693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.6861950Z 
2025-05-07T20:33:34.6862190Z self = <triton.compiler.compiler.ASTSource object at 0x7fae292c7340>
2025-05-07T20:33:34.6863391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.6864932Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a1a9e0>}
2025-05-07T20:33:34.6866437Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.6867590Z context = <triton._C.libtriton.ir.context object at 0x7fae2914f230>
2025-05-07T20:33:34.6867911Z 
2025-05-07T20:33:34.6868105Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.6868707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.6869295Z                            module_map=module_map)
2025-05-07T20:33:34.6869711Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.6870111Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:34.6870415Z E       ^
2025-05-07T20:33:34.6870940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.6871497Z 
2025-05-07T20:33:34.6871976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.6872555Z 
2025-05-07T20:33:34.6872672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.6873144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.6873644Z     T=128,
2025-05-07T20:33:34.6873853Z     D=7168,
2025-05-07T20:33:34.6874075Z     scale_ub=None,
2025-05-07T20:33:34.6874322Z     contiguous=False,
2025-05-07T20:33:34.6874576Z     compiled=False,
2025-05-07T20:33:34.6874808Z )
2025-05-07T20:33:35.0693821Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.0694641Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:35.0695240Z 
2025-05-07T20:33:35.0695371Z     @given(
2025-05-07T20:33:35.0695743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.0696249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.0696627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.0697009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.0697387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.0697818Z     )
2025-05-07T20:33:35.0698290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.0698814Z     def test_silu_mul_quant(
2025-05-07T20:33:35.0699106Z         self,
2025-05-07T20:33:35.0699360Z         T: int,
2025-05-07T20:33:35.0699613Z         D: int,
2025-05-07T20:33:35.0699872Z         scale_ub: Optional[float],
2025-05-07T20:33:35.0700184Z         contiguous: bool,
2025-05-07T20:33:35.0700463Z         compiled: bool,
2025-05-07T20:33:35.0700739Z     ) -> None:
2025-05-07T20:33:35.0700983Z         torch.manual_seed(2025)
2025-05-07T20:33:35.0701262Z     
2025-05-07T20:33:35.0701581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.0701970Z     
2025-05-07T20:33:35.0702198Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.0702531Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.0702912Z         x = x_sign * x_clamp
2025-05-07T20:33:35.0703193Z         x0 = x[:, :D]
2025-05-07T20:33:35.0703441Z         x1 = x[:, D:]
2025-05-07T20:33:35.0703682Z     
2025-05-07T20:33:35.0703901Z         if contiguous:
2025-05-07T20:33:35.0704165Z             x0 = x0.contiguous()
2025-05-07T20:33:35.0704462Z             x1 = x1.contiguous()
2025-05-07T20:33:35.0704739Z     
2025-05-07T20:33:35.0704955Z         if scale_ub is not None:
2025-05-07T20:33:35.0705275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.0705657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.0706016Z             )
2025-05-07T20:33:35.0706242Z         else:
2025-05-07T20:33:35.0706486Z             scale_ub_tensor = None
2025-05-07T20:33:35.0706773Z     
2025-05-07T20:33:35.0707036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.0707397Z             op = silu_mul_quant
2025-05-07T20:33:35.0707687Z             if compiled:
2025-05-07T20:33:35.0707971Z                 op = torch.compile(op)
2025-05-07T20:33:35.0708318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.0708642Z     
2025-05-07T20:33:35.0708859Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.0709055Z 
2025-05-07T20:33:35.0709170Z moe/activation_test.py:117: 
2025-05-07T20:33:35.0709508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.0709888Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.0710209Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.0710995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.0711785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.0712471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.0713251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.0714100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.0714712Z     kernel = self.compile(
2025-05-07T20:33:35.0715323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.0716072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.0716526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.0716785Z 
2025-05-07T20:33:35.0717021Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28fba950>
2025-05-07T20:33:35.0718584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.0720206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e560>}
2025-05-07T20:33:35.0721774Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.0722930Z context = <triton._C.libtriton.ir.context object at 0x7fae291d5530>
2025-05-07T20:33:35.0723259Z 
2025-05-07T20:33:35.0723449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.0724244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.0724784Z                            module_map=module_map)
2025-05-07T20:33:35.0725206Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.0725602Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.0725897Z E       ^
2025-05-07T20:33:35.0726427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.0726939Z 
2025-05-07T20:33:35.0727410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.0727995Z 
2025-05-07T20:33:35.0728114Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.0728584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.0729041Z     T=4096,
2025-05-07T20:33:35.0729252Z     D=5120,
2025-05-07T20:33:35.0729478Z     scale_ub=1200.0,
2025-05-07T20:33:35.0729740Z     contiguous=True,
2025-05-07T20:33:35.0729989Z     compiled=False,
2025-05-07T20:33:35.0730225Z )
2025-05-07T20:33:35.0730595Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.0731150Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:35.0731467Z 
2025-05-07T20:33:35.0731558Z     @given(
2025-05-07T20:33:35.0731831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.0732184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.0732539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.0732917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.0733296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.0733620Z     )
2025-05-07T20:33:35.0734022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.0734527Z     def test_silu_mul_quant(
2025-05-07T20:33:35.0734802Z         self,
2025-05-07T20:33:35.0735030Z         T: int,
2025-05-07T20:33:35.0735257Z         D: int,
2025-05-07T20:33:35.0735518Z         scale_ub: Optional[float],
2025-05-07T20:33:35.0735912Z         contiguous: bool,
2025-05-07T20:33:35.0743003Z         compiled: bool,
2025-05-07T20:33:35.0743278Z     ) -> None:
2025-05-07T20:33:35.0743525Z         torch.manual_seed(2025)
2025-05-07T20:33:35.0743812Z     
2025-05-07T20:33:35.0744132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.0744515Z     
2025-05-07T20:33:35.0744742Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.0745075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.0745424Z         x = x_sign * x_clamp
2025-05-07T20:33:35.0745703Z         x0 = x[:, :D]
2025-05-07T20:33:35.0745957Z         x1 = x[:, D:]
2025-05-07T20:33:35.0746196Z     
2025-05-07T20:33:35.0746424Z         if contiguous:
2025-05-07T20:33:35.0746695Z             x0 = x0.contiguous()
2025-05-07T20:33:35.0747101Z             x1 = x1.contiguous()
2025-05-07T20:33:35.0747380Z     
2025-05-07T20:33:35.0747598Z         if scale_ub is not None:
2025-05-07T20:33:35.0747911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.0748286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.0748633Z             )
2025-05-07T20:33:35.0748853Z         else:
2025-05-07T20:33:35.0749164Z             scale_ub_tensor = None
2025-05-07T20:33:35.0749447Z     
2025-05-07T20:33:35.0749774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.0750133Z             op = silu_mul_quant
2025-05-07T20:33:35.0750419Z             if compiled:
2025-05-07T20:33:35.0750702Z                 op = torch.compile(op)
2025-05-07T20:33:35.0751034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.0751349Z     
2025-05-07T20:33:35.0751573Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.0751761Z 
2025-05-07T20:33:35.0751878Z moe/activation_test.py:117: 
2025-05-07T20:33:35.0752212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.0752589Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.0752911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.0753767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.0754549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.0755162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.0755924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.0756670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.0757272Z     kernel = self.compile(
2025-05-07T20:33:35.0757890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.0758680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.0759131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.0759392Z 
2025-05-07T20:33:35.0759631Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29164d90>
2025-05-07T20:33:35.0760846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.0762392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e7a0>}
2025-05-07T20:33:35.0763902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.0765052Z context = <triton._C.libtriton.ir.context object at 0x7fae291da670>
2025-05-07T20:33:35.0765433Z 
2025-05-07T20:33:35.0765628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.0766210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.0766743Z                            module_map=module_map)
2025-05-07T20:33:35.0767158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.0767557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.0767845Z E       ^
2025-05-07T20:33:35.0768370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.0768877Z 
2025-05-07T20:33:35.0769353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.0769978Z 
2025-05-07T20:33:35.0770099Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.0770562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.0771013Z     T=1,
2025-05-07T20:33:35.0771223Z     D=5120,
2025-05-07T20:33:35.0771440Z     scale_ub=None,
2025-05-07T20:33:35.0771685Z     contiguous=True,
2025-05-07T20:33:35.0771986Z     compiled=True,
2025-05-07T20:33:35.0772209Z )
2025-05-07T20:33:35.7032583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.7033409Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:35.7033871Z 
2025-05-07T20:33:35.7033999Z     @given(
2025-05-07T20:33:35.7034335Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.7034768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.7035115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.7035489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.7035858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.7036184Z     )
2025-05-07T20:33:35.7036578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.7037081Z     def test_silu_mul_quant(
2025-05-07T20:33:35.7037355Z         self,
2025-05-07T20:33:35.7037573Z         T: int,
2025-05-07T20:33:35.7037803Z         D: int,
2025-05-07T20:33:35.7038052Z         scale_ub: Optional[float],
2025-05-07T20:33:35.7038361Z         contiguous: bool,
2025-05-07T20:33:35.7038630Z         compiled: bool,
2025-05-07T20:33:35.7038884Z     ) -> None:
2025-05-07T20:33:35.7039132Z         torch.manual_seed(2025)
2025-05-07T20:33:35.7039405Z     
2025-05-07T20:33:35.7039714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.7040102Z     
2025-05-07T20:33:35.7040316Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.7040646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.7040995Z         x = x_sign * x_clamp
2025-05-07T20:33:35.7041262Z         x0 = x[:, :D]
2025-05-07T20:33:35.7041505Z         x1 = x[:, D:]
2025-05-07T20:33:35.7041745Z     
2025-05-07T20:33:35.7041954Z         if contiguous:
2025-05-07T20:33:35.7042219Z             x0 = x0.contiguous()
2025-05-07T20:33:35.7042512Z             x1 = x1.contiguous()
2025-05-07T20:33:35.7042778Z     
2025-05-07T20:33:35.7042998Z         if scale_ub is not None:
2025-05-07T20:33:35.7043314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.7043694Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.7044037Z             )
2025-05-07T20:33:35.7044256Z         else:
2025-05-07T20:33:35.7044495Z             scale_ub_tensor = None
2025-05-07T20:33:35.7044774Z     
2025-05-07T20:33:35.7045037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7045394Z             op = silu_mul_quant
2025-05-07T20:33:35.7045673Z             if compiled:
2025-05-07T20:33:35.7045954Z                 op = torch.compile(op)
2025-05-07T20:33:35.7046288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7046698Z     
2025-05-07T20:33:35.7046924Z         y_fp8, y_scale = fn()
2025-05-07T20:33:35.7047249Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:35.7047570Z     
2025-05-07T20:33:35.7047839Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7048218Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:35.7048595Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:35.7048948Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:35.7049351Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.7049698Z     
2025-05-07T20:33:35.7049921Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:35.7050144Z 
2025-05-07T20:33:35.7050256Z moe/activation_test.py:126: 
2025-05-07T20:33:35.7050666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7051035Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:35.7051408Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:35.7052299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:35.7053215Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:35.7053866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.7054632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.7055397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:35.7056204Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:35.7057056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:35.7057900Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:35.7058769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:35.7059482Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:35.7060165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:35.7060751Z     fn()
2025-05-07T20:33:35.7061321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:35.7061968Z     self.fn.run(
2025-05-07T20:33:35.7062493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.7063091Z     kernel = self.compile(
2025-05-07T20:33:35.7063695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.7064432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.7064880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7065136Z 
2025-05-07T20:33:35.7065376Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29037fa0>
2025-05-07T20:33:35.7066582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.7068118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e050>}
2025-05-07T20:33:35.7069669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.7070819Z context = <triton._C.libtriton.ir.context object at 0x7fae28a9e470>
2025-05-07T20:33:35.7071142Z 
2025-05-07T20:33:35.7071336Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.7071921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.7072450Z                            module_map=module_map)
2025-05-07T20:33:35.7072862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.7073257Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:35.7073627Z E       ^
2025-05-07T20:33:35.7074151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.7074656Z 
2025-05-07T20:33:35.7075191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.7075763Z 
2025-05-07T20:33:35.7075884Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.7076350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.7076797Z     T=2048,
2025-05-07T20:33:35.7077005Z     D=5120,
2025-05-07T20:33:35.7077268Z     scale_ub=None,
2025-05-07T20:33:35.7077580Z     contiguous=True,
2025-05-07T20:33:35.7077831Z     compiled=True,
2025-05-07T20:33:35.7078058Z )
2025-05-07T20:33:36.2898169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.2899446Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.2899993Z 
2025-05-07T20:33:36.2900149Z     @given(
2025-05-07T20:33:36.2900608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.2901213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.2901836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.2902490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.2903144Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.2903711Z     )
2025-05-07T20:33:36.2904403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.2905274Z     def test_silu_mul_quant(
2025-05-07T20:33:36.2905751Z         self,
2025-05-07T20:33:36.2906151Z         T: int,
2025-05-07T20:33:36.2906540Z         D: int,
2025-05-07T20:33:36.2906961Z         scale_ub: Optional[float],
2025-05-07T20:33:36.2907497Z         contiguous: bool,
2025-05-07T20:33:36.2907969Z         compiled: bool,
2025-05-07T20:33:36.2908404Z     ) -> None:
2025-05-07T20:33:36.2908693Z         torch.manual_seed(2025)
2025-05-07T20:33:36.2908970Z     
2025-05-07T20:33:36.2909274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.2909664Z     
2025-05-07T20:33:36.2909882Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.2910208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.2910566Z         x = x_sign * x_clamp
2025-05-07T20:33:36.2910842Z         x0 = x[:, :D]
2025-05-07T20:33:36.2911084Z         x1 = x[:, D:]
2025-05-07T20:33:36.2911323Z     
2025-05-07T20:33:36.2911538Z         if contiguous:
2025-05-07T20:33:36.2911801Z             x0 = x0.contiguous()
2025-05-07T20:33:36.2912096Z             x1 = x1.contiguous()
2025-05-07T20:33:36.2912373Z     
2025-05-07T20:33:36.2912591Z         if scale_ub is not None:
2025-05-07T20:33:36.2912911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.2913295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.2913734Z             )
2025-05-07T20:33:36.2913950Z         else:
2025-05-07T20:33:36.2914195Z             scale_ub_tensor = None
2025-05-07T20:33:36.2914484Z     
2025-05-07T20:33:36.2914747Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.2915107Z             op = silu_mul_quant
2025-05-07T20:33:36.2915391Z             if compiled:
2025-05-07T20:33:36.2915801Z                 op = torch.compile(op)
2025-05-07T20:33:36.2916145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.2916457Z     
2025-05-07T20:33:36.2916674Z         y_fp8, y_scale = fn()
2025-05-07T20:33:36.2917003Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:36.2917336Z     
2025-05-07T20:33:36.2917605Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.2917987Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:36.2918320Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:36.2918678Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:36.2919080Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.2919434Z     
2025-05-07T20:33:36.2919669Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:36.2919963Z 
2025-05-07T20:33:36.2920078Z moe/activation_test.py:126: 
2025-05-07T20:33:36.2920417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.2920802Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:36.2921170Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:36.2922126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:36.2923034Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:36.2923657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.2924682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.2925460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:36.2926284Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:36.2927140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:36.2927983Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:36.2928802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:36.2929534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:36.2930214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:36.2930794Z     fn()
2025-05-07T20:33:36.2931368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:36.2932023Z     self.fn.run(
2025-05-07T20:33:36.2932548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.2933155Z     kernel = self.compile(
2025-05-07T20:33:36.2933769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.2934511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.2934959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.2935226Z 
2025-05-07T20:33:36.2935459Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28a4a890>
2025-05-07T20:33:36.2936672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.2938219Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae292e77f0>}
2025-05-07T20:33:36.2939861Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.2941014Z context = <triton._C.libtriton.ir.context object at 0x7fae28e7ef30>
2025-05-07T20:33:36.2941349Z 
2025-05-07T20:33:36.2941541Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.2942129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.2942655Z                            module_map=module_map)
2025-05-07T20:33:36.2943067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.2943471Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:36.2943766Z E       ^
2025-05-07T20:33:36.2944289Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.2944872Z 
2025-05-07T20:33:36.2945344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.2945920Z 
2025-05-07T20:33:36.2946044Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.2946507Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.2947029Z     T=128,
2025-05-07T20:33:36.2947304Z     D=5120,
2025-05-07T20:33:36.2947522Z     scale_ub=None,
2025-05-07T20:33:36.2947769Z     contiguous=True,
2025-05-07T20:33:36.2948024Z     compiled=True,
2025-05-07T20:33:36.2948252Z )
2025-05-07T20:33:37.2624194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2625036Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.2625446Z 
2025-05-07T20:33:37.2625568Z     @given(
2025-05-07T20:33:37.2625943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2626291Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2626624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2626992Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2627346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2627656Z     )
2025-05-07T20:33:37.2628042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2628519Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2628788Z         self,
2025-05-07T20:33:37.2629002Z         T: int,
2025-05-07T20:33:37.2629219Z         D: int,
2025-05-07T20:33:37.2629454Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2629750Z         contiguous: bool,
2025-05-07T20:33:37.2630012Z         compiled: bool,
2025-05-07T20:33:37.2630254Z     ) -> None:
2025-05-07T20:33:37.2630491Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2630757Z     
2025-05-07T20:33:37.2631051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2631423Z     
2025-05-07T20:33:37.2631636Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2631950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2632285Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2632549Z         x0 = x[:, :D]
2025-05-07T20:33:37.2632781Z         x1 = x[:, D:]
2025-05-07T20:33:37.2633014Z     
2025-05-07T20:33:37.2633221Z         if contiguous:
2025-05-07T20:33:37.2633472Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2633838Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2634100Z     
2025-05-07T20:33:37.2634304Z         if scale_ub is not None:
2025-05-07T20:33:37.2634604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2634968Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2635303Z             )
2025-05-07T20:33:37.2635510Z         else:
2025-05-07T20:33:37.2635745Z             scale_ub_tensor = None
2025-05-07T20:33:37.2636019Z     
2025-05-07T20:33:37.2636267Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2636609Z             op = silu_mul_quant
2025-05-07T20:33:37.2637034Z             if compiled:
2025-05-07T20:33:37.2637307Z                 op = torch.compile(op)
2025-05-07T20:33:37.2637632Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2637935Z     
2025-05-07T20:33:37.2638143Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.2638457Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.2638774Z     
2025-05-07T20:33:37.2639026Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2639390Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.2639710Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.2640060Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.2640445Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.2640861Z     
2025-05-07T20:33:37.2641084Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.2641295Z 
2025-05-07T20:33:37.2641404Z moe/activation_test.py:126: 
2025-05-07T20:33:37.2641731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2642099Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.2642451Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.2643464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.2644281Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.2644876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2645621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2646371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.2647161Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.2647976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:37.2648782Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.2649606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.2650359Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.2651014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.2651576Z     fn()
2025-05-07T20:33:37.2652133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.2652772Z     self.fn.run(
2025-05-07T20:33:37.2653279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2653864Z     kernel = self.compile(
2025-05-07T20:33:37.2654458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2655172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2655608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2655871Z 
2025-05-07T20:33:37.2656097Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28a382b0>
2025-05-07T20:33:37.2657275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2658788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c24280>}
2025-05-07T20:33:37.2660325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2661440Z context = <triton._C.libtriton.ir.context object at 0x7fae28ceac30>
2025-05-07T20:33:37.2661757Z 
2025-05-07T20:33:37.2661939Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2662506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2663009Z                            module_map=module_map)
2025-05-07T20:33:37.2663405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2663792Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.2664124Z E       ^
2025-05-07T20:33:37.2664627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2665125Z 
2025-05-07T20:33:37.2665584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2666138Z 
2025-05-07T20:33:37.2666304Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2666788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2667225Z     T=4096,
2025-05-07T20:33:37.2667431Z     D=5120,
2025-05-07T20:33:37.2667638Z     scale_ub=None,
2025-05-07T20:33:37.2667874Z     contiguous=True,
2025-05-07T20:33:37.2668118Z     compiled=True,
2025-05-07T20:33:37.2668346Z )
2025-05-07T20:33:38.0379851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.0380685Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:38.0381096Z 
2025-05-07T20:33:38.0381223Z     @given(
2025-05-07T20:33:38.0381474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.0381816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.0382149Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.0382503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.0382870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.0383181Z     )
2025-05-07T20:33:38.0383561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.0384036Z     def test_silu_mul_quant(
2025-05-07T20:33:38.0384298Z         self,
2025-05-07T20:33:38.0384508Z         T: int,
2025-05-07T20:33:38.0384717Z         D: int,
2025-05-07T20:33:38.0384958Z         scale_ub: Optional[float],
2025-05-07T20:33:38.0385254Z         contiguous: bool,
2025-05-07T20:33:38.0385509Z         compiled: bool,
2025-05-07T20:33:38.0385756Z     ) -> None:
2025-05-07T20:33:38.0385996Z         torch.manual_seed(2025)
2025-05-07T20:33:38.0386255Z     
2025-05-07T20:33:38.0386553Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.0386920Z     
2025-05-07T20:33:38.0387124Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.0387433Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.0387770Z         x = x_sign * x_clamp
2025-05-07T20:33:38.0388035Z         x0 = x[:, :D]
2025-05-07T20:33:38.0388268Z         x1 = x[:, D:]
2025-05-07T20:33:38.0388493Z     
2025-05-07T20:33:38.0388695Z         if contiguous:
2025-05-07T20:33:38.0388942Z             x0 = x0.contiguous()
2025-05-07T20:33:38.0389220Z             x1 = x1.contiguous()
2025-05-07T20:33:38.0389476Z     
2025-05-07T20:33:38.0389680Z         if scale_ub is not None:
2025-05-07T20:33:38.0389971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.0390332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.0390663Z             )
2025-05-07T20:33:38.0390873Z         else:
2025-05-07T20:33:38.0391102Z             scale_ub_tensor = None
2025-05-07T20:33:38.0391369Z     
2025-05-07T20:33:38.0391738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.0392083Z             op = silu_mul_quant
2025-05-07T20:33:38.0392355Z             if compiled:
2025-05-07T20:33:38.0392626Z                 op = torch.compile(op)
2025-05-07T20:33:38.0392955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.0393250Z     
2025-05-07T20:33:38.0393459Z         y_fp8, y_scale = fn()
2025-05-07T20:33:38.0393834Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:38.0394151Z     
2025-05-07T20:33:38.0394403Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.0394765Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:38.0395083Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:38.0395418Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:38.0395878Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.0396219Z     
2025-05-07T20:33:38.0396440Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:38.0396655Z 
2025-05-07T20:33:38.0396763Z moe/activation_test.py:126: 
2025-05-07T20:33:38.0397089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0397517Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:38.0397919Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.0398774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:38.0399582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:38.0400164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.0400898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.0401640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:38.0402419Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.0403229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:38.0404032Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.0404816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:38.0405504Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:38.0406152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:38.0406719Z     fn()
2025-05-07T20:33:38.0407273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:38.0407895Z     self.fn.run(
2025-05-07T20:33:38.0408401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.0408972Z     kernel = self.compile(
2025-05-07T20:33:38.0409549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.0410248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.0410675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.0410921Z 
2025-05-07T20:33:38.0411150Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28b81a50>
2025-05-07T20:33:38.0412308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.0413823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c252d0>}
2025-05-07T20:33:38.0415255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.0416342Z context = <triton._C.libtriton.ir.context object at 0x7fae284d4e70>
2025-05-07T20:33:38.0416650Z 
2025-05-07T20:33:38.0416833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.0417391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.0417895Z                            module_map=module_map)
2025-05-07T20:33:38.0418333Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.0418715Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:38.0418998Z E       ^
2025-05-07T20:33:38.0419499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.0419978Z 
2025-05-07T20:33:38.0420429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.0421017Z 
2025-05-07T20:33:38.0421170Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.0421617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.0422045Z     T=16384,
2025-05-07T20:33:38.0422259Z     D=5120,
2025-05-07T20:33:38.0422463Z     scale_ub=None,
2025-05-07T20:33:38.0422700Z     contiguous=True,
2025-05-07T20:33:38.0422942Z     compiled=True,
2025-05-07T20:33:38.0423157Z )
2025-05-07T20:33:38.0830116Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:38.0831720Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:38.0833173Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:38.0834329Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:38.0835530Z W0507 20:33:38.081000 88487 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:38.1923329Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.1925110Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:38.1925852Z 
2025-05-07T20:33:38.1926072Z     @given(
2025-05-07T20:33:38.1926676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.1927404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.1927997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.1928641Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.1929270Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.1929669Z     )
2025-05-07T20:33:38.1930089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.1930577Z     def test_silu_mul_quant(
2025-05-07T20:33:38.1930847Z         self,
2025-05-07T20:33:38.1931067Z         T: int,
2025-05-07T20:33:38.1931281Z         D: int,
2025-05-07T20:33:38.1931528Z         scale_ub: Optional[float],
2025-05-07T20:33:38.1931835Z         contiguous: bool,
2025-05-07T20:33:38.1932097Z         compiled: bool,
2025-05-07T20:33:38.1932345Z     ) -> None:
2025-05-07T20:33:38.1932583Z         torch.manual_seed(2025)
2025-05-07T20:33:38.1932973Z     
2025-05-07T20:33:38.1933282Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.1933661Z     
2025-05-07T20:33:38.1933875Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.1934197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.1934542Z         x = x_sign * x_clamp
2025-05-07T20:33:38.1934812Z         x0 = x[:, :D]
2025-05-07T20:33:38.1935048Z         x1 = x[:, D:]
2025-05-07T20:33:38.1935278Z     
2025-05-07T20:33:38.1935487Z         if contiguous:
2025-05-07T20:33:38.1935742Z             x0 = x0.contiguous()
2025-05-07T20:33:38.1936032Z             x1 = x1.contiguous()
2025-05-07T20:33:38.1936305Z     
2025-05-07T20:33:38.1936516Z         if scale_ub is not None:
2025-05-07T20:33:38.1936825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.1937274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.1937611Z             )
2025-05-07T20:33:38.1937828Z         else:
2025-05-07T20:33:38.1938067Z             scale_ub_tensor = None
2025-05-07T20:33:38.1938343Z     
2025-05-07T20:33:38.1938606Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.1938954Z             op = silu_mul_quant
2025-05-07T20:33:38.1939308Z             if compiled:
2025-05-07T20:33:38.1939641Z                 op = torch.compile(op)
2025-05-07T20:33:38.1939977Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.1940284Z     
2025-05-07T20:33:38.1940492Z         y_fp8, y_scale = fn()
2025-05-07T20:33:38.1940818Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:38.1941140Z     
2025-05-07T20:33:38.1941401Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.1941773Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:38.1942105Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:38.1942449Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:38.1942847Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.1943196Z     
2025-05-07T20:33:38.1943419Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:38.1943639Z 
2025-05-07T20:33:38.1943754Z moe/activation_test.py:126: 
2025-05-07T20:33:38.1944090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1944464Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:38.1944825Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.1945701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:38.1946534Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:38.1947149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.1947905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.1948671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:38.1949529Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.1950365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:38.1951198Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.1952017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:38.1952733Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:38.1953398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:38.1954085Z     fn()
2025-05-07T20:33:38.1954705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:38.1955352Z     self.fn.run(
2025-05-07T20:33:38.1955865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.1956463Z     kernel = self.compile(
2025-05-07T20:33:38.1957067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.1957789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.1958230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.1958488Z 
2025-05-07T20:33:38.1958723Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28385150>
2025-05-07T20:33:38.1959933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.1961527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c360e0>}
2025-05-07T20:33:38.1963102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.1964240Z context = <triton._C.libtriton.ir.context object at 0x7fad07db1530>
2025-05-07T20:33:38.1964559Z 
2025-05-07T20:33:38.1964750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.1965327Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.1965847Z                            module_map=module_map)
2025-05-07T20:33:38.1966255Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.1966653Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:38.1966948Z E       ^
2025-05-07T20:33:38.1967463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.1967968Z 
2025-05-07T20:33:38.1968436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.1969001Z 
2025-05-07T20:33:38.1969126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.1969579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.1970025Z     T=1,
2025-05-07T20:33:38.1970231Z     D=5120,
2025-05-07T20:33:38.1970445Z     scale_ub=1200.0,
2025-05-07T20:33:38.1970697Z     contiguous=True,
2025-05-07T20:33:38.1970949Z     compiled=True,
2025-05-07T20:33:38.1971173Z )
2025-05-07T20:33:38.3493384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.3494933Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:38.3495704Z 
2025-05-07T20:33:38.3495945Z     @given(
2025-05-07T20:33:38.3496540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.3497221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.3497874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.3498583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.3499098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.3499400Z     )
2025-05-07T20:33:38.3499780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.3500257Z     def test_silu_mul_quant(
2025-05-07T20:33:38.3500519Z         self,
2025-05-07T20:33:38.3500723Z         T: int,
2025-05-07T20:33:38.3500944Z         D: int,
2025-05-07T20:33:38.3501183Z         scale_ub: Optional[float],
2025-05-07T20:33:38.3501469Z         contiguous: bool,
2025-05-07T20:33:38.3501739Z         compiled: bool,
2025-05-07T20:33:38.3502111Z     ) -> None:
2025-05-07T20:33:38.3502346Z         torch.manual_seed(2025)
2025-05-07T20:33:38.3502608Z     
2025-05-07T20:33:38.3502902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.3503267Z     
2025-05-07T20:33:38.3503481Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.3503796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.3504133Z         x = x_sign * x_clamp
2025-05-07T20:33:38.3504400Z         x0 = x[:, :D]
2025-05-07T20:33:38.3504644Z         x1 = x[:, D:]
2025-05-07T20:33:38.3504865Z     
2025-05-07T20:33:38.3505072Z         if contiguous:
2025-05-07T20:33:38.3505323Z             x0 = x0.contiguous()
2025-05-07T20:33:38.3505604Z             x1 = x1.contiguous()
2025-05-07T20:33:38.3505934Z     
2025-05-07T20:33:38.3506144Z         if scale_ub is not None:
2025-05-07T20:33:38.3506444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.3506803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.3507137Z             )
2025-05-07T20:33:38.3507348Z         else:
2025-05-07T20:33:38.3507571Z             scale_ub_tensor = None
2025-05-07T20:33:38.3507842Z     
2025-05-07T20:33:38.3508165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.3508556Z             op = silu_mul_quant
2025-05-07T20:33:38.3508830Z             if compiled:
2025-05-07T20:33:38.3509101Z                 op = torch.compile(op)
2025-05-07T20:33:38.3509415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3509761Z     
2025-05-07T20:33:38.3509973Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.3510151Z 
2025-05-07T20:33:38.3510258Z moe/activation_test.py:117: 
2025-05-07T20:33:38.3510578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3510937Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.3511242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.3511840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.3512441Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.3513147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.3513967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.3514540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.3515268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.3515974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.3516535Z     kernel = self.compile(
2025-05-07T20:33:38.3517119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.3517826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.3518255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.3518514Z 
2025-05-07T20:33:38.3518792Z self = <triton.compiler.compiler.ASTSource object at 0x7fae283551b0>
2025-05-07T20:33:38.3520241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.3521786Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28222560>}
2025-05-07T20:33:38.3523218Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.3524648Z context = <triton._C.libtriton.ir.context object at 0x7fae280ae0b0>
2025-05-07T20:33:38.3524966Z 
2025-05-07T20:33:38.3525144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.3525707Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.3526210Z                            module_map=module_map)
2025-05-07T20:33:38.3526597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.3526976Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.3527255Z E       ^
2025-05-07T20:33:38.3527745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.3528230Z 
2025-05-07T20:33:38.3528739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.3529291Z 
2025-05-07T20:33:38.3529408Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.3529852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.3530278Z     T=1,
2025-05-07T20:33:38.3530477Z     D=5120,
2025-05-07T20:33:38.3530758Z     scale_ub=None,
2025-05-07T20:33:38.3530989Z     contiguous=False,
2025-05-07T20:33:38.3531289Z     compiled=True,
2025-05-07T20:33:38.3531512Z )
2025-05-07T20:33:38.4211268Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.4212045Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:38.4212429Z 
2025-05-07T20:33:38.4212545Z     @given(
2025-05-07T20:33:38.4212860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.4213194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.4213522Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.4213875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.4214230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.4214534Z     )
2025-05-07T20:33:38.4214908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.4215373Z     def test_silu_mul_quant(
2025-05-07T20:33:38.4215629Z         self,
2025-05-07T20:33:38.4215837Z         T: int,
2025-05-07T20:33:38.4216054Z         D: int,
2025-05-07T20:33:38.4216292Z         scale_ub: Optional[float],
2025-05-07T20:33:38.4216575Z         contiguous: bool,
2025-05-07T20:33:38.4216829Z         compiled: bool,
2025-05-07T20:33:38.4217067Z     ) -> None:
2025-05-07T20:33:38.4217295Z         torch.manual_seed(2025)
2025-05-07T20:33:38.4217552Z     
2025-05-07T20:33:38.4217841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.4218198Z     
2025-05-07T20:33:38.4218406Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.4218714Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.4219039Z         x = x_sign * x_clamp
2025-05-07T20:33:38.4219298Z         x0 = x[:, :D]
2025-05-07T20:33:38.4219533Z         x1 = x[:, D:]
2025-05-07T20:33:38.4219750Z     
2025-05-07T20:33:38.4219950Z         if contiguous:
2025-05-07T20:33:38.4220199Z             x0 = x0.contiguous()
2025-05-07T20:33:38.4220473Z             x1 = x1.contiguous()
2025-05-07T20:33:38.4220733Z     
2025-05-07T20:33:38.4220943Z         if scale_ub is not None:
2025-05-07T20:33:38.4221232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.4221593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.4221922Z             )
2025-05-07T20:33:38.4222128Z         else:
2025-05-07T20:33:38.4222346Z             scale_ub_tensor = None
2025-05-07T20:33:38.4222613Z     
2025-05-07T20:33:38.4222861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.4223193Z             op = silu_mul_quant
2025-05-07T20:33:38.4223460Z             if compiled:
2025-05-07T20:33:38.4223723Z                 op = torch.compile(op)
2025-05-07T20:33:38.4224351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.4224647Z     
2025-05-07T20:33:38.4224853Z         y_fp8, y_scale = fn()
2025-05-07T20:33:38.4225152Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:38.4225463Z     
2025-05-07T20:33:38.4225719Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.4226069Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:38.4226385Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:38.4226718Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:38.4227098Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.4227430Z     
2025-05-07T20:33:38.4227660Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:38.4227934Z 
2025-05-07T20:33:38.4228045Z moe/activation_test.py:126: 
2025-05-07T20:33:38.4228357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.4228713Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:38.4229060Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:38.4229888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:38.4230812Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:38.4231389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.4232109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.4232828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:38.4233662Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.4234464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:38.4235250Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:38.4236008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:38.4236686Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:38.4237326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:38.4237869Z     fn()
2025-05-07T20:33:38.4238404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:38.4239016Z     self.fn.run(
2025-05-07T20:33:38.4239508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.4240066Z     kernel = self.compile(
2025-05-07T20:33:38.4240644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.4241332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.4241756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.4242001Z 
2025-05-07T20:33:38.4242224Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28386b60>
2025-05-07T20:33:38.4243358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.4244801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d47910>}
2025-05-07T20:33:38.4251716Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.4252810Z context = <triton._C.libtriton.ir.context object at 0x7fae280f7030>
2025-05-07T20:33:38.4253121Z 
2025-05-07T20:33:38.4253300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.4253847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.4254343Z                            module_map=module_map)
2025-05-07T20:33:38.4254728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.4255102Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:38.4255384Z E       ^
2025-05-07T20:33:38.4255880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.4256402Z 
2025-05-07T20:33:38.4256844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.4257386Z 
2025-05-07T20:33:38.4257498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.4257933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.4258396Z     T=1,
2025-05-07T20:33:38.4258587Z     D=5120,
2025-05-07T20:33:38.4258836Z     scale_ub=None,
2025-05-07T20:33:38.4259067Z     contiguous=True,
2025-05-07T20:33:38.4259300Z     compiled=False,
2025-05-07T20:33:38.4259515Z )
2025-05-07T20:33:38.7503845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.7504634Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:38.7505018Z 
2025-05-07T20:33:38.7505135Z     @given(
2025-05-07T20:33:38.7505479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.7505949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.7506394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.7506817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.7507164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.7507468Z     )
2025-05-07T20:33:38.7507833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.7508303Z     def test_silu_mul_quant(
2025-05-07T20:33:38.7508560Z         self,
2025-05-07T20:33:38.7508761Z         T: int,
2025-05-07T20:33:38.7508969Z         D: int,
2025-05-07T20:33:38.7509197Z         scale_ub: Optional[float],
2025-05-07T20:33:38.7509479Z         contiguous: bool,
2025-05-07T20:33:38.7509730Z         compiled: bool,
2025-05-07T20:33:38.7509970Z     ) -> None:
2025-05-07T20:33:38.7510197Z         torch.manual_seed(2025)
2025-05-07T20:33:38.7510453Z     
2025-05-07T20:33:38.7510745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.7511107Z     
2025-05-07T20:33:38.7511311Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.7511622Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.7511950Z         x = x_sign * x_clamp
2025-05-07T20:33:38.7512209Z         x0 = x[:, :D]
2025-05-07T20:33:38.7512438Z         x1 = x[:, D:]
2025-05-07T20:33:38.7512661Z     
2025-05-07T20:33:38.7512855Z         if contiguous:
2025-05-07T20:33:38.7513105Z             x0 = x0.contiguous()
2025-05-07T20:33:38.7513383Z             x1 = x1.contiguous()
2025-05-07T20:33:38.7513697Z     
2025-05-07T20:33:38.7513903Z         if scale_ub is not None:
2025-05-07T20:33:38.7514195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.7514544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.7514869Z             )
2025-05-07T20:33:38.7515074Z         else:
2025-05-07T20:33:38.7515295Z             scale_ub_tensor = None
2025-05-07T20:33:38.7515564Z     
2025-05-07T20:33:38.7515808Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.7516135Z             op = silu_mul_quant
2025-05-07T20:33:38.7516520Z             if compiled:
2025-05-07T20:33:38.7516789Z                 op = torch.compile(op)
2025-05-07T20:33:38.7517100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7517387Z     
2025-05-07T20:33:38.7517596Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.7517768Z 
2025-05-07T20:33:38.7517882Z moe/activation_test.py:117: 
2025-05-07T20:33:38.7518188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7518535Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.7518831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7519557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.7520304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.7520934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.7521654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.7522348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.7522964Z     kernel = self.compile(
2025-05-07T20:33:38.7523592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.7524481Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.7524901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7525141Z 
2025-05-07T20:33:38.7525357Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2814f640>
2025-05-07T20:33:38.7526485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.7527926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d47d00>}
2025-05-07T20:33:38.7529330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.7530391Z context = <triton._C.libtriton.ir.context object at 0x7fad07a681f0>
2025-05-07T20:33:38.7530697Z 
2025-05-07T20:33:38.7530871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.7531414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.7531912Z                            module_map=module_map)
2025-05-07T20:33:38.7532290Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.7532658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.7532937Z E       ^
2025-05-07T20:33:38.7533416Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.7533889Z 
2025-05-07T20:33:38.7534330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.7534864Z 
2025-05-07T20:33:38.7534973Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.7535404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.7535817Z     T=128,
2025-05-07T20:33:38.7536013Z     D=5120,
2025-05-07T20:33:38.7536220Z     scale_ub=None,
2025-05-07T20:33:38.7536443Z     contiguous=False,
2025-05-07T20:33:38.7536685Z     compiled=True,
2025-05-07T20:33:38.7536907Z )
2025-05-07T20:33:38.7537235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.7537828Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:38.7538115Z 
2025-05-07T20:33:38.7538197Z     @given(
2025-05-07T20:33:38.7538442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.7538767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.7539089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.7539440Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.7539787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.7540090Z     )
2025-05-07T20:33:38.7540461Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.7540918Z     def test_silu_mul_quant(
2025-05-07T20:33:38.7541172Z         self,
2025-05-07T20:33:38.7541377Z         T: int,
2025-05-07T20:33:38.7541585Z         D: int,
2025-05-07T20:33:38.7541877Z         scale_ub: Optional[float],
2025-05-07T20:33:38.7542164Z         contiguous: bool,
2025-05-07T20:33:38.7542416Z         compiled: bool,
2025-05-07T20:33:38.7542650Z     ) -> None:
2025-05-07T20:33:38.7542878Z         torch.manual_seed(2025)
2025-05-07T20:33:38.7543134Z     
2025-05-07T20:33:38.7543414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.7543841Z     
2025-05-07T20:33:38.7544047Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.7544406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.7544736Z         x = x_sign * x_clamp
2025-05-07T20:33:38.7544990Z         x0 = x[:, :D]
2025-05-07T20:33:38.7545214Z         x1 = x[:, D:]
2025-05-07T20:33:38.7545432Z     
2025-05-07T20:33:38.7545630Z         if contiguous:
2025-05-07T20:33:38.7545870Z             x0 = x0.contiguous()
2025-05-07T20:33:38.7546140Z             x1 = x1.contiguous()
2025-05-07T20:33:38.7546395Z     
2025-05-07T20:33:38.7546601Z         if scale_ub is not None:
2025-05-07T20:33:38.7546885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.7547238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.7547566Z             )
2025-05-07T20:33:38.7547767Z         else:
2025-05-07T20:33:38.7547988Z             scale_ub_tensor = None
2025-05-07T20:33:38.7548254Z     
2025-05-07T20:33:38.7548497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.7548833Z             op = silu_mul_quant
2025-05-07T20:33:38.7549099Z             if compiled:
2025-05-07T20:33:38.7549355Z                 op = torch.compile(op)
2025-05-07T20:33:38.7549665Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7549954Z     
2025-05-07T20:33:38.7550153Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.7550333Z 
2025-05-07T20:33:38.7550438Z moe/activation_test.py:117: 
2025-05-07T20:33:38.7550750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7551094Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.7551394Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.7551977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:38.7552561Z     return fn(*args, **kwargs)
2025-05-07T20:33:38.7553243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.7554015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.7554574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.7555286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.7555973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.7556528Z     kernel = self.compile(
2025-05-07T20:33:38.7557099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.7557827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.7558244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.7558491Z 
2025-05-07T20:33:38.7558705Z self = <triton.compiler.compiler.ASTSource object at 0x7fae280b5bd0>
2025-05-07T20:33:38.7559828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.7561253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d44940>}
2025-05-07T20:33:38.7562652Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.7563822Z context = <triton._C.libtriton.ir.context object at 0x7fad07aa1730>
2025-05-07T20:33:38.7564122Z 
2025-05-07T20:33:38.7564301Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.7564849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.7565439Z                            module_map=module_map)
2025-05-07T20:33:38.7565823Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.7566192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.7566464Z E       ^
2025-05-07T20:33:38.7566952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.7567424Z 
2025-05-07T20:33:38.7567865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.7568401Z 
2025-05-07T20:33:38.7568515Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.7568947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.7569368Z     T=128,
2025-05-07T20:33:38.7569569Z     D=7168,
2025-05-07T20:33:38.7569767Z     scale_ub=1200.0,
2025-05-07T20:33:38.7570009Z     contiguous=False,
2025-05-07T20:33:38.7570250Z     compiled=False,
2025-05-07T20:33:38.7570461Z )
2025-05-07T20:33:38.8840213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.8840935Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:38.8841375Z 
2025-05-07T20:33:38.8841507Z     @given(
2025-05-07T20:33:38.8841860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.8842332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.8842784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.8843242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.8843589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.8843890Z     )
2025-05-07T20:33:38.8844254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.8844725Z     def test_silu_mul_quant(
2025-05-07T20:33:38.8844986Z         self,
2025-05-07T20:33:38.8845193Z         T: int,
2025-05-07T20:33:38.8845403Z         D: int,
2025-05-07T20:33:38.8845630Z         scale_ub: Optional[float],
2025-05-07T20:33:38.8845919Z         contiguous: bool,
2025-05-07T20:33:38.8846170Z         compiled: bool,
2025-05-07T20:33:38.8846405Z     ) -> None:
2025-05-07T20:33:38.8846635Z         torch.manual_seed(2025)
2025-05-07T20:33:38.8846894Z     
2025-05-07T20:33:38.8847183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.8847536Z     
2025-05-07T20:33:38.8847747Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.8848055Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.8848375Z         x = x_sign * x_clamp
2025-05-07T20:33:38.8848751Z         x0 = x[:, :D]
2025-05-07T20:33:38.8848984Z         x1 = x[:, D:]
2025-05-07T20:33:38.8849201Z     
2025-05-07T20:33:38.8849401Z         if contiguous:
2025-05-07T20:33:38.8849645Z             x0 = x0.contiguous()
2025-05-07T20:33:38.8849914Z             x1 = x1.contiguous()
2025-05-07T20:33:38.8850168Z     
2025-05-07T20:33:38.8850373Z         if scale_ub is not None:
2025-05-07T20:33:38.8850656Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.8851009Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.8851338Z             )
2025-05-07T20:33:38.8851544Z         else:
2025-05-07T20:33:38.8851761Z             scale_ub_tensor = None
2025-05-07T20:33:38.8852031Z     
2025-05-07T20:33:38.8852275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.8852670Z             op = silu_mul_quant
2025-05-07T20:33:38.8852933Z             if compiled:
2025-05-07T20:33:38.8853206Z                 op = torch.compile(op)
2025-05-07T20:33:38.8853517Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.8853805Z     
2025-05-07T20:33:38.8854012Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.8854187Z 
2025-05-07T20:33:38.8854292Z moe/activation_test.py:117: 
2025-05-07T20:33:38.8854727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.8855102Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.8855396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.8856114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.8856835Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.8857394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.8858111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.8858799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.8859358Z     kernel = self.compile(
2025-05-07T20:33:38.8859922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.8860613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.8861023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.8861267Z 
2025-05-07T20:33:38.8861483Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07a70ca0>
2025-05-07T20:33:38.8862604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.8864045Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289abf40>}
2025-05-07T20:33:38.8865441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.8866516Z context = <triton._C.libtriton.ir.context object at 0x7fad0795d630>
2025-05-07T20:33:38.8866822Z 
2025-05-07T20:33:38.8866995Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.8867539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.8868024Z                            module_map=module_map)
2025-05-07T20:33:38.8868406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.8868776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.8869045Z E       ^
2025-05-07T20:33:38.8869579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.8870056Z 
2025-05-07T20:33:38.8870490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.8871027Z 
2025-05-07T20:33:38.8871145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.8871575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.8871992Z     T=128,
2025-05-07T20:33:38.8872189Z     D=5120,
2025-05-07T20:33:38.8872394Z     scale_ub=None,
2025-05-07T20:33:38.8872621Z     contiguous=False,
2025-05-07T20:33:38.8872862Z     compiled=False,
2025-05-07T20:33:38.8873081Z )
2025-05-07T20:33:38.8873413Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:38.8874065Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:38.8874345Z 
2025-05-07T20:33:38.8874431Z     @given(
2025-05-07T20:33:38.8874674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:38.8875004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:38.8875325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:38.8875667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:38.8876103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:38.8876404Z     )
2025-05-07T20:33:38.8876774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:38.8877230Z     def test_silu_mul_quant(
2025-05-07T20:33:38.8877481Z         self,
2025-05-07T20:33:38.8877687Z         T: int,
2025-05-07T20:33:38.8877889Z         D: int,
2025-05-07T20:33:38.8878119Z         scale_ub: Optional[float],
2025-05-07T20:33:38.8878406Z         contiguous: bool,
2025-05-07T20:33:38.8878660Z         compiled: bool,
2025-05-07T20:33:38.8878895Z     ) -> None:
2025-05-07T20:33:38.8879123Z         torch.manual_seed(2025)
2025-05-07T20:33:38.8879375Z     
2025-05-07T20:33:38.8879666Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:38.8880021Z     
2025-05-07T20:33:38.8880227Z         x_sign = torch.sign(x)
2025-05-07T20:33:38.8880532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:38.8880860Z         x = x_sign * x_clamp
2025-05-07T20:33:38.8881112Z         x0 = x[:, :D]
2025-05-07T20:33:38.8881337Z         x1 = x[:, D:]
2025-05-07T20:33:38.8881557Z     
2025-05-07T20:33:38.8881749Z         if contiguous:
2025-05-07T20:33:38.8881995Z             x0 = x0.contiguous()
2025-05-07T20:33:38.8882270Z             x1 = x1.contiguous()
2025-05-07T20:33:38.8882524Z     
2025-05-07T20:33:38.8882727Z         if scale_ub is not None:
2025-05-07T20:33:38.8883016Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:38.8883369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:38.8883688Z             )
2025-05-07T20:33:38.8883890Z         else:
2025-05-07T20:33:38.8884113Z             scale_ub_tensor = None
2025-05-07T20:33:38.8884369Z     
2025-05-07T20:33:38.8884613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:38.8884942Z             op = silu_mul_quant
2025-05-07T20:33:38.8885202Z             if compiled:
2025-05-07T20:33:38.8885468Z                 op = torch.compile(op)
2025-05-07T20:33:38.8885782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.8886064Z     
2025-05-07T20:33:38.8886271Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:38.8886450Z 
2025-05-07T20:33:38.8886554Z moe/activation_test.py:117: 
2025-05-07T20:33:38.8886863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.8887204Z moe/activation_test.py:115: in fn
2025-05-07T20:33:38.8887499Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:38.8888222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:38.8888996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:38.8889560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:38.8890274Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:38.8890971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:38.8891524Z     kernel = self.compile(
2025-05-07T20:33:38.8892091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:38.8892775Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:38.8893185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:38.8893473Z 
2025-05-07T20:33:38.8893689Z self = <triton.compiler.compiler.ASTSource object at 0x7fad079820e0>
2025-05-07T20:33:38.8894811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:38.8896319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a95a0>}
2025-05-07T20:33:38.8897719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:38.8898781Z context = <triton._C.libtriton.ir.context object at 0x7fad079f4eb0>
2025-05-07T20:33:38.8899087Z 
2025-05-07T20:33:38.8899262Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:38.8899826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:38.8900354Z                            module_map=module_map)
2025-05-07T20:33:38.8900732Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:38.8901098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:38.8901371Z E       ^
2025-05-07T20:33:38.8901858Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:38.8902331Z 
2025-05-07T20:33:38.8902763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:38.8903302Z 
2025-05-07T20:33:38.8903410Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:38.8903842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:38.8904262Z     T=128,
2025-05-07T20:33:38.8904459Z     D=5120,
2025-05-07T20:33:38.8904662Z     scale_ub=1200.0,
2025-05-07T20:33:38.8904891Z     contiguous=True,
2025-05-07T20:33:38.8905122Z     compiled=False,
2025-05-07T20:33:38.8905337Z )
2025-05-07T20:33:39.0843086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.0843613Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:39.0844060Z 
2025-05-07T20:33:39.0844195Z     @given(
2025-05-07T20:33:39.0844552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.0844997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.0845415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.0845754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.0846098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.0846400Z     )
2025-05-07T20:33:39.0846758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.0847221Z     def test_silu_mul_quant(
2025-05-07T20:33:39.0847473Z         self,
2025-05-07T20:33:39.0847674Z         T: int,
2025-05-07T20:33:39.0847993Z         D: int,
2025-05-07T20:33:39.0848229Z         scale_ub: Optional[float],
2025-05-07T20:33:39.0848510Z         contiguous: bool,
2025-05-07T20:33:39.0848762Z         compiled: bool,
2025-05-07T20:33:39.0848996Z     ) -> None:
2025-05-07T20:33:39.0849218Z         torch.manual_seed(2025)
2025-05-07T20:33:39.0849475Z     
2025-05-07T20:33:39.0849767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.0850121Z     
2025-05-07T20:33:39.0850327Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.0856618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.0856983Z         x = x_sign * x_clamp
2025-05-07T20:33:39.0857233Z         x0 = x[:, :D]
2025-05-07T20:33:39.0857459Z         x1 = x[:, D:]
2025-05-07T20:33:39.0857675Z     
2025-05-07T20:33:39.0857972Z         if contiguous:
2025-05-07T20:33:39.0858217Z             x0 = x0.contiguous()
2025-05-07T20:33:39.0858486Z             x1 = x1.contiguous()
2025-05-07T20:33:39.0858733Z     
2025-05-07T20:33:39.0858937Z         if scale_ub is not None:
2025-05-07T20:33:39.0859219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.0859563Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.0859985Z             )
2025-05-07T20:33:39.0860222Z         else:
2025-05-07T20:33:39.0860515Z             scale_ub_tensor = None
2025-05-07T20:33:39.0860779Z     
2025-05-07T20:33:39.0861022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.0861355Z             op = silu_mul_quant
2025-05-07T20:33:39.0861610Z             if compiled:
2025-05-07T20:33:39.0861873Z                 op = torch.compile(op)
2025-05-07T20:33:39.0862184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0862472Z     
2025-05-07T20:33:39.0862675Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.0862854Z 
2025-05-07T20:33:39.0862963Z moe/activation_test.py:117: 
2025-05-07T20:33:39.0863274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0863624Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.0863919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0864643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.0865364Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.0865920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.0866628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.0867309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.0867865Z     kernel = self.compile(
2025-05-07T20:33:39.0868431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.0869119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.0869526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0869769Z 
2025-05-07T20:33:39.0869981Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07983700>
2025-05-07T20:33:39.0871108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.0872535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289abd90>}
2025-05-07T20:33:39.0874001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.0875123Z context = <triton._C.libtriton.ir.context object at 0x7fad079183b0>
2025-05-07T20:33:39.0875430Z 
2025-05-07T20:33:39.0875605Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.0876147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.0876633Z                            module_map=module_map)
2025-05-07T20:33:39.0877017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.0877383Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.0877653Z E       ^
2025-05-07T20:33:39.0878130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.0878605Z 
2025-05-07T20:33:39.0879040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.0879613Z 
2025-05-07T20:33:39.0879726Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.0880156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.0880572Z     T=1,
2025-05-07T20:33:39.0880763Z     D=7168,
2025-05-07T20:33:39.0880967Z     scale_ub=1200.0,
2025-05-07T20:33:39.0881243Z     contiguous=True,
2025-05-07T20:33:39.0881515Z     compiled=True,
2025-05-07T20:33:39.0881728Z )
2025-05-07T20:33:39.0882054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.0882558Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:39.0882825Z 
2025-05-07T20:33:39.0882910Z     @given(
2025-05-07T20:33:39.0883147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.0883474Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.0883796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.0884136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.0884481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.0884782Z     )
2025-05-07T20:33:39.0885146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.0885600Z     def test_silu_mul_quant(
2025-05-07T20:33:39.0885855Z         self,
2025-05-07T20:33:39.0886060Z         T: int,
2025-05-07T20:33:39.0886266Z         D: int,
2025-05-07T20:33:39.0886493Z         scale_ub: Optional[float],
2025-05-07T20:33:39.0886777Z         contiguous: bool,
2025-05-07T20:33:39.0887023Z         compiled: bool,
2025-05-07T20:33:39.0887256Z     ) -> None:
2025-05-07T20:33:39.0887480Z         torch.manual_seed(2025)
2025-05-07T20:33:39.0887727Z     
2025-05-07T20:33:39.0888011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.0888367Z     
2025-05-07T20:33:39.0888564Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.0888866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.0889185Z         x = x_sign * x_clamp
2025-05-07T20:33:39.0889464Z         x0 = x[:, :D]
2025-05-07T20:33:39.0889709Z         x1 = x[:, D:]
2025-05-07T20:33:39.0889925Z     
2025-05-07T20:33:39.0890118Z         if contiguous:
2025-05-07T20:33:39.0890354Z             x0 = x0.contiguous()
2025-05-07T20:33:39.0890625Z             x1 = x1.contiguous()
2025-05-07T20:33:39.0890875Z     
2025-05-07T20:33:39.0891073Z         if scale_ub is not None:
2025-05-07T20:33:39.0891362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.0891708Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.0892026Z             )
2025-05-07T20:33:39.0892226Z         else:
2025-05-07T20:33:39.0892447Z             scale_ub_tensor = None
2025-05-07T20:33:39.0892708Z     
2025-05-07T20:33:39.0892950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.0893279Z             op = silu_mul_quant
2025-05-07T20:33:39.0893535Z             if compiled:
2025-05-07T20:33:39.0893791Z                 op = torch.compile(op)
2025-05-07T20:33:39.0894148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0894430Z     
2025-05-07T20:33:39.0894635Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.0894809Z 
2025-05-07T20:33:39.0894928Z moe/activation_test.py:117: 
2025-05-07T20:33:39.0895239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0895577Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.0895870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.0896450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.0897037Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.0897715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.0898471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.0899030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.0899751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.0900478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.0901117Z     kernel = self.compile(
2025-05-07T20:33:39.0901680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.0902357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.0902766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.0903006Z 
2025-05-07T20:33:39.0903226Z self = <triton.compiler.compiler.ASTSource object at 0x7fad078792a0>
2025-05-07T20:33:39.0904346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.0905763Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289ab1c0>}
2025-05-07T20:33:39.0907157Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.0908221Z context = <triton._C.libtriton.ir.context object at 0x7fad07891170>
2025-05-07T20:33:39.0908519Z 
2025-05-07T20:33:39.0908696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.0909232Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.0909725Z                            module_map=module_map)
2025-05-07T20:33:39.0910105Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.0910471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.0910739Z E       ^
2025-05-07T20:33:39.0911220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.0911689Z 
2025-05-07T20:33:39.0912127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.0912657Z 
2025-05-07T20:33:39.0912766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.0913191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.0913680Z     T=1,
2025-05-07T20:33:39.0913873Z     D=7168,
2025-05-07T20:33:39.0914076Z     scale_ub=1200.0,
2025-05-07T20:33:39.0914311Z     contiguous=False,
2025-05-07T20:33:39.0914549Z     compiled=True,
2025-05-07T20:33:39.0914756Z )
2025-05-07T20:33:39.2300090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.2301006Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.2301399Z 
2025-05-07T20:33:39.2301519Z     @given(
2025-05-07T20:33:39.2301832Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.2302267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.2302694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.2303068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.2303409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.2303710Z     )
2025-05-07T20:33:39.2304073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.2304530Z     def test_silu_mul_quant(
2025-05-07T20:33:39.2304785Z         self,
2025-05-07T20:33:39.2305070Z         T: int,
2025-05-07T20:33:39.2305272Z         D: int,
2025-05-07T20:33:39.2305504Z         scale_ub: Optional[float],
2025-05-07T20:33:39.2305787Z         contiguous: bool,
2025-05-07T20:33:39.2306037Z         compiled: bool,
2025-05-07T20:33:39.2306274Z     ) -> None:
2025-05-07T20:33:39.2306501Z         torch.manual_seed(2025)
2025-05-07T20:33:39.2306749Z     
2025-05-07T20:33:39.2307033Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.2307457Z     
2025-05-07T20:33:39.2307747Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.2308050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.2308378Z         x = x_sign * x_clamp
2025-05-07T20:33:39.2308630Z         x0 = x[:, :D]
2025-05-07T20:33:39.2308851Z         x1 = x[:, D:]
2025-05-07T20:33:39.2309067Z     
2025-05-07T20:33:39.2309268Z         if contiguous:
2025-05-07T20:33:39.2309506Z             x0 = x0.contiguous()
2025-05-07T20:33:39.2309786Z             x1 = x1.contiguous()
2025-05-07T20:33:39.2310057Z     
2025-05-07T20:33:39.2310256Z         if scale_ub is not None:
2025-05-07T20:33:39.2310546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.2310899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.2311218Z             )
2025-05-07T20:33:39.2311426Z         else:
2025-05-07T20:33:39.2311647Z             scale_ub_tensor = None
2025-05-07T20:33:39.2311913Z     
2025-05-07T20:33:39.2312160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.2312484Z             op = silu_mul_quant
2025-05-07T20:33:39.2312745Z             if compiled:
2025-05-07T20:33:39.2313002Z                 op = torch.compile(op)
2025-05-07T20:33:39.2313319Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2313680Z     
2025-05-07T20:33:39.2313879Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.2314059Z 
2025-05-07T20:33:39.2314166Z moe/activation_test.py:117: 
2025-05-07T20:33:39.2314474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2314826Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.2315117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.2315710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.2316295Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.2316982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.2317704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.2318269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.2318980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.2319717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.2320273Z     kernel = self.compile(
2025-05-07T20:33:39.2320836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.2321568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.2321984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.2322229Z 
2025-05-07T20:33:39.2322450Z self = <triton.compiler.compiler.ASTSource object at 0x7fad078fda50>
2025-05-07T20:33:39.2323575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.2325192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a85e0>}
2025-05-07T20:33:39.2326661Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.2327726Z context = <triton._C.libtriton.ir.context object at 0x7fad078d1030>
2025-05-07T20:33:39.2328025Z 
2025-05-07T20:33:39.2328202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.2328867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.2329372Z                            module_map=module_map)
2025-05-07T20:33:39.2329791Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.2330163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.2330431Z E       ^
2025-05-07T20:33:39.2330917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.2331389Z 
2025-05-07T20:33:39.2331827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.2332359Z 
2025-05-07T20:33:39.2332478Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.2332905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.2333323Z     T=1,
2025-05-07T20:33:39.2333520Z     D=7168,
2025-05-07T20:33:39.2333719Z     scale_ub=None,
2025-05-07T20:33:39.2333947Z     contiguous=False,
2025-05-07T20:33:39.2334183Z     compiled=True,
2025-05-07T20:33:39.2334390Z )
2025-05-07T20:33:39.4931767Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.4932462Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:39.4932874Z 
2025-05-07T20:33:39.4933001Z     @given(
2025-05-07T20:33:39.4933355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.4933828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.4934241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.4934591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.4934970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.4935399Z     )
2025-05-07T20:33:39.4935917Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.4936567Z     def test_silu_mul_quant(
2025-05-07T20:33:39.4936836Z         self,
2025-05-07T20:33:39.4937045Z         T: int,
2025-05-07T20:33:39.4937253Z         D: int,
2025-05-07T20:33:39.4937479Z         scale_ub: Optional[float],
2025-05-07T20:33:39.4937764Z         contiguous: bool,
2025-05-07T20:33:39.4938014Z         compiled: bool,
2025-05-07T20:33:39.4938246Z     ) -> None:
2025-05-07T20:33:39.4938480Z         torch.manual_seed(2025)
2025-05-07T20:33:39.4938733Z     
2025-05-07T20:33:39.4939021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.4939402Z     
2025-05-07T20:33:39.4939645Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.4939952Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.4940411Z         x = x_sign * x_clamp
2025-05-07T20:33:39.4940676Z         x0 = x[:, :D]
2025-05-07T20:33:39.4940905Z         x1 = x[:, D:]
2025-05-07T20:33:39.4941122Z     
2025-05-07T20:33:39.4941321Z         if contiguous:
2025-05-07T20:33:39.4941569Z             x0 = x0.contiguous()
2025-05-07T20:33:39.4941843Z             x1 = x1.contiguous()
2025-05-07T20:33:39.4942103Z     
2025-05-07T20:33:39.4942308Z         if scale_ub is not None:
2025-05-07T20:33:39.4942592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.4942942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.4943268Z             )
2025-05-07T20:33:39.4943468Z         else:
2025-05-07T20:33:39.4943692Z             scale_ub_tensor = None
2025-05-07T20:33:39.4943956Z     
2025-05-07T20:33:39.4944265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.4944599Z             op = silu_mul_quant
2025-05-07T20:33:39.4944863Z             if compiled:
2025-05-07T20:33:39.4945131Z                 op = torch.compile(op)
2025-05-07T20:33:39.4945444Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.4945736Z     
2025-05-07T20:33:39.4945941Z         y_fp8, y_scale = fn()
2025-05-07T20:33:39.4946302Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:39.4946610Z     
2025-05-07T20:33:39.4946937Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.4947294Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:39.4947600Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:39.4947929Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:39.4948310Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:39.4948635Z     
2025-05-07T20:33:39.4948855Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:39.4949065Z 
2025-05-07T20:33:39.4949176Z moe/activation_test.py:126: 
2025-05-07T20:33:39.4949489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.4949844Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:39.4950189Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:39.4951016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:39.4951804Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:39.4952373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.4953091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.4953890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:39.4954651Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:39.4955447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:39.4956234Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:39.4957000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:39.4957675Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:39.4958310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:39.4958855Z     fn()
2025-05-07T20:33:39.4959384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:39.4959989Z     self.fn.run(
2025-05-07T20:33:39.4960484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.4961033Z     kernel = self.compile(
2025-05-07T20:33:39.4961649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.4962331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.4962751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.4962994Z 
2025-05-07T20:33:39.4963212Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0788ca90>
2025-05-07T20:33:39.4964342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.4965788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2837eef0>}
2025-05-07T20:33:39.4967239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.4968307Z context = <triton._C.libtriton.ir.context object at 0x7fad07f3c370>
2025-05-07T20:33:39.4968649Z 
2025-05-07T20:33:39.4968862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.4969410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.4969903Z                            module_map=module_map)
2025-05-07T20:33:39.4970279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.4970652Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:39.4970934Z E       ^
2025-05-07T20:33:39.4971420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.4971896Z 
2025-05-07T20:33:39.4972334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.4972874Z 
2025-05-07T20:33:39.4972984Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.4973416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.4973837Z     T=1,
2025-05-07T20:33:39.4974029Z     D=5120,
2025-05-07T20:33:39.4974235Z     scale_ub=1200.0,
2025-05-07T20:33:39.4974468Z     contiguous=False,
2025-05-07T20:33:39.4974706Z     compiled=True,
2025-05-07T20:33:39.4974919Z )
2025-05-07T20:33:39.6663123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.6663700Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.6663987Z 
2025-05-07T20:33:39.6664075Z     @given(
2025-05-07T20:33:39.6664336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.6664662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.6664987Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.6665332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.6665679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.6665981Z     )
2025-05-07T20:33:39.6666356Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.6666818Z     def test_silu_mul_quant(
2025-05-07T20:33:39.6667070Z         self,
2025-05-07T20:33:39.6667277Z         T: int,
2025-05-07T20:33:39.6667489Z         D: int,
2025-05-07T20:33:39.6667712Z         scale_ub: Optional[float],
2025-05-07T20:33:39.6668001Z         contiguous: bool,
2025-05-07T20:33:39.6668253Z         compiled: bool,
2025-05-07T20:33:39.6668493Z     ) -> None:
2025-05-07T20:33:39.6668726Z         torch.manual_seed(2025)
2025-05-07T20:33:39.6668989Z     
2025-05-07T20:33:39.6669278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.6669643Z     
2025-05-07T20:33:39.6669986Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.6670291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.6670616Z         x = x_sign * x_clamp
2025-05-07T20:33:39.6670868Z         x0 = x[:, :D]
2025-05-07T20:33:39.6671100Z         x1 = x[:, D:]
2025-05-07T20:33:39.6671322Z     
2025-05-07T20:33:39.6671527Z         if contiguous:
2025-05-07T20:33:39.6671774Z             x0 = x0.contiguous()
2025-05-07T20:33:39.6672043Z             x1 = x1.contiguous()
2025-05-07T20:33:39.6672300Z     
2025-05-07T20:33:39.6672504Z         if scale_ub is not None:
2025-05-07T20:33:39.6672787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.6673139Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.6673468Z             )
2025-05-07T20:33:39.6673759Z         else:
2025-05-07T20:33:39.6674053Z             scale_ub_tensor = None
2025-05-07T20:33:39.6674320Z     
2025-05-07T20:33:39.6674574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.6674905Z             op = silu_mul_quant
2025-05-07T20:33:39.6675170Z             if compiled:
2025-05-07T20:33:39.6675433Z                 op = torch.compile(op)
2025-05-07T20:33:39.6675744Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.6682224Z     
2025-05-07T20:33:39.6682464Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.6682761Z 
2025-05-07T20:33:39.6682873Z moe/activation_test.py:117: 
2025-05-07T20:33:39.6683200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.6683562Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.6683860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.6684476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.6685087Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.6685791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.6686540Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.6687122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.6687859Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.6688567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.6689137Z     kernel = self.compile(
2025-05-07T20:33:39.6689723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.6690427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.6690849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.6691106Z 
2025-05-07T20:33:39.6691330Z self = <triton.compiler.compiler.ASTSource object at 0x7fad079813f0>
2025-05-07T20:33:39.6692501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.6694043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2837feb0>}
2025-05-07T20:33:39.6695486Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.6696583Z context = <triton._C.libtriton.ir.context object at 0x7fad07f661f0>
2025-05-07T20:33:39.6696907Z 
2025-05-07T20:33:39.6697086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.6697701Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.6698215Z                            module_map=module_map)
2025-05-07T20:33:39.6698606Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.6698988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.6699276Z E       ^
2025-05-07T20:33:39.6699776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.6700277Z 
2025-05-07T20:33:39.6700721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.6701268Z 
2025-05-07T20:33:39.6701380Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.6701825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.6702291Z     T=1,
2025-05-07T20:33:39.6702494Z     D=5120,
2025-05-07T20:33:39.6702705Z     scale_ub=1200.0,
2025-05-07T20:33:39.6702944Z     contiguous=False,
2025-05-07T20:33:39.6703194Z     compiled=False,
2025-05-07T20:33:39.6703419Z )
2025-05-07T20:33:39.6703754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.6704281Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:39.6704607Z 
2025-05-07T20:33:39.6704733Z     @given(
2025-05-07T20:33:39.6704974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.6705311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.6705638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.6705994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.6706336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.6706640Z     )
2025-05-07T20:33:39.6707014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.6707481Z     def test_silu_mul_quant(
2025-05-07T20:33:39.6707740Z         self,
2025-05-07T20:33:39.6707950Z         T: int,
2025-05-07T20:33:39.6708158Z         D: int,
2025-05-07T20:33:39.6708394Z         scale_ub: Optional[float],
2025-05-07T20:33:39.6708686Z         contiguous: bool,
2025-05-07T20:33:39.6708940Z         compiled: bool,
2025-05-07T20:33:39.6709184Z     ) -> None:
2025-05-07T20:33:39.6709417Z         torch.manual_seed(2025)
2025-05-07T20:33:39.6709678Z     
2025-05-07T20:33:39.6709975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.6710341Z     
2025-05-07T20:33:39.6710543Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.6710856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.6711190Z         x = x_sign * x_clamp
2025-05-07T20:33:39.6711447Z         x0 = x[:, :D]
2025-05-07T20:33:39.6711673Z         x1 = x[:, D:]
2025-05-07T20:33:39.6711897Z     
2025-05-07T20:33:39.6712098Z         if contiguous:
2025-05-07T20:33:39.6712345Z             x0 = x0.contiguous()
2025-05-07T20:33:39.6712621Z             x1 = x1.contiguous()
2025-05-07T20:33:39.6712880Z     
2025-05-07T20:33:39.6713079Z         if scale_ub is not None:
2025-05-07T20:33:39.6713369Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.6713786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.6714110Z             )
2025-05-07T20:33:39.6714318Z         else:
2025-05-07T20:33:39.6714543Z             scale_ub_tensor = None
2025-05-07T20:33:39.6714806Z     
2025-05-07T20:33:39.6715051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.6715387Z             op = silu_mul_quant
2025-05-07T20:33:39.6715651Z             if compiled:
2025-05-07T20:33:39.6715914Z                 op = torch.compile(op)
2025-05-07T20:33:39.6716230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.6716520Z     
2025-05-07T20:33:39.6716724Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.6716903Z 
2025-05-07T20:33:39.6717008Z moe/activation_test.py:117: 
2025-05-07T20:33:39.6717376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.6717729Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.6718031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.6718765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.6719488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.6720057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.6720769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.6721465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.6722065Z     kernel = self.compile(
2025-05-07T20:33:39.6722638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.6723330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.6723748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.6724170Z 
2025-05-07T20:33:39.6724471Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07f567a0>
2025-05-07T20:33:39.6725655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.6727093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a9120>}
2025-05-07T20:33:39.6728505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.6729568Z context = <triton._C.libtriton.ir.context object at 0x7fad07fdb930>
2025-05-07T20:33:39.6729876Z 
2025-05-07T20:33:39.6730053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.6730605Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.6731101Z                            module_map=module_map)
2025-05-07T20:33:39.6731481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.6731853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.6732129Z E       ^
2025-05-07T20:33:39.6732611Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.6733092Z 
2025-05-07T20:33:39.6733530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.6734076Z 
2025-05-07T20:33:39.6734194Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.6734640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.6735059Z     T=16384,
2025-05-07T20:33:39.6735263Z     D=5120,
2025-05-07T20:33:39.6735472Z     scale_ub=1200.0,
2025-05-07T20:33:39.6735707Z     contiguous=False,
2025-05-07T20:33:39.6735945Z     compiled=True,
2025-05-07T20:33:39.6736163Z )
2025-05-07T20:33:39.7740812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.7741351Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.7741670Z 
2025-05-07T20:33:39.7741755Z     @given(
2025-05-07T20:33:39.7742007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.7742360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.7742692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.7743055Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.7743506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.7743807Z     )
2025-05-07T20:33:39.7744183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.7744655Z     def test_silu_mul_quant(
2025-05-07T20:33:39.7744906Z         self,
2025-05-07T20:33:39.7745117Z         T: int,
2025-05-07T20:33:39.7745332Z         D: int,
2025-05-07T20:33:39.7745563Z         scale_ub: Optional[float],
2025-05-07T20:33:39.7745856Z         contiguous: bool,
2025-05-07T20:33:39.7746113Z         compiled: bool,
2025-05-07T20:33:39.7746346Z     ) -> None:
2025-05-07T20:33:39.7746577Z         torch.manual_seed(2025)
2025-05-07T20:33:39.7746832Z     
2025-05-07T20:33:39.7747117Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.7747581Z     
2025-05-07T20:33:39.7747783Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.7748093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.7748424Z         x = x_sign * x_clamp
2025-05-07T20:33:39.7748672Z         x0 = x[:, :D]
2025-05-07T20:33:39.7748905Z         x1 = x[:, D:]
2025-05-07T20:33:39.7749131Z     
2025-05-07T20:33:39.7749324Z         if contiguous:
2025-05-07T20:33:39.7749642Z             x0 = x0.contiguous()
2025-05-07T20:33:39.7749975Z             x1 = x1.contiguous()
2025-05-07T20:33:39.7750227Z     
2025-05-07T20:33:39.7750434Z         if scale_ub is not None:
2025-05-07T20:33:39.7750724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.7751084Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.7751406Z             )
2025-05-07T20:33:39.7751616Z         else:
2025-05-07T20:33:39.7751846Z             scale_ub_tensor = None
2025-05-07T20:33:39.7752107Z     
2025-05-07T20:33:39.7752361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.7752694Z             op = silu_mul_quant
2025-05-07T20:33:39.7752951Z             if compiled:
2025-05-07T20:33:39.7753219Z                 op = torch.compile(op)
2025-05-07T20:33:39.7753596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7753885Z     
2025-05-07T20:33:39.7754091Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.7754265Z 
2025-05-07T20:33:39.7754380Z moe/activation_test.py:117: 
2025-05-07T20:33:39.7754697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7755042Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.7755339Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7755929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.7756511Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.7757198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.7757926Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.7758487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.7759199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.7759892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.7760455Z     kernel = self.compile(
2025-05-07T20:33:39.7761017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.7761702Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.7762119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7762362Z 
2025-05-07T20:33:39.7762583Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07982d70>
2025-05-07T20:33:39.7763752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.7765192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad075108b0>}
2025-05-07T20:33:39.7766590Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.7767659Z context = <triton._C.libtriton.ir.context object at 0x7fad07500370>
2025-05-07T20:33:39.7767958Z 
2025-05-07T20:33:39.7768133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.7768677Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.7769232Z                            module_map=module_map)
2025-05-07T20:33:39.7769622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.7769985Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.7770258Z E       ^
2025-05-07T20:33:39.7770746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.7771294Z 
2025-05-07T20:33:39.7771727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.7772262Z 
2025-05-07T20:33:39.7772374Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.7772808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.7773228Z     T=2048,
2025-05-07T20:33:39.7773424Z     D=7168,
2025-05-07T20:33:39.7773633Z     scale_ub=1200.0,
2025-05-07T20:33:39.7773880Z     contiguous=False,
2025-05-07T20:33:39.7774114Z     compiled=True,
2025-05-07T20:33:39.7774335Z )
2025-05-07T20:33:39.7774674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.7775188Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:39.7775482Z 
2025-05-07T20:33:39.7775564Z     @given(
2025-05-07T20:33:39.7775813Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.7776144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.7776462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.7776812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.7777161Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.7777454Z     )
2025-05-07T20:33:39.7777825Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.7778288Z     def test_silu_mul_quant(
2025-05-07T20:33:39.7778540Z         self,
2025-05-07T20:33:39.7778749Z         T: int,
2025-05-07T20:33:39.7778964Z         D: int,
2025-05-07T20:33:39.7779189Z         scale_ub: Optional[float],
2025-05-07T20:33:39.7779482Z         contiguous: bool,
2025-05-07T20:33:39.7779736Z         compiled: bool,
2025-05-07T20:33:39.7779968Z     ) -> None:
2025-05-07T20:33:39.7780196Z         torch.manual_seed(2025)
2025-05-07T20:33:39.7780452Z     
2025-05-07T20:33:39.7780741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.7781097Z     
2025-05-07T20:33:39.7781302Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.7781616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.7781937Z         x = x_sign * x_clamp
2025-05-07T20:33:39.7782192Z         x0 = x[:, :D]
2025-05-07T20:33:39.7782425Z         x1 = x[:, D:]
2025-05-07T20:33:39.7782641Z     
2025-05-07T20:33:39.7782839Z         if contiguous:
2025-05-07T20:33:39.7783087Z             x0 = x0.contiguous()
2025-05-07T20:33:39.7783359Z             x1 = x1.contiguous()
2025-05-07T20:33:39.7783618Z     
2025-05-07T20:33:39.7783827Z         if scale_ub is not None:
2025-05-07T20:33:39.7784168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.7784525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.7784850Z             )
2025-05-07T20:33:39.7785054Z         else:
2025-05-07T20:33:39.7785282Z             scale_ub_tensor = None
2025-05-07T20:33:39.7785552Z     
2025-05-07T20:33:39.7785801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.7786125Z             op = silu_mul_quant
2025-05-07T20:33:39.7786395Z             if compiled:
2025-05-07T20:33:39.7786658Z                 op = torch.compile(op)
2025-05-07T20:33:39.7786964Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7787255Z     
2025-05-07T20:33:39.7787456Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.7787629Z 
2025-05-07T20:33:39.7787734Z moe/activation_test.py:117: 
2025-05-07T20:33:39.7788092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7788443Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.7788736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.7789325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:39.7789911Z     return fn(*args, **kwargs)
2025-05-07T20:33:39.7790685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.7791399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.7791960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.7792676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.7793370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.7793996Z     kernel = self.compile(
2025-05-07T20:33:39.7794566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.7795252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.7795661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.7795914Z 
2025-05-07T20:33:39.7796133Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0759c040>
2025-05-07T20:33:39.7797255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.7798681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07511090>}
2025-05-07T20:33:39.7800086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.7801148Z context = <triton._C.libtriton.ir.context object at 0x7fad075a2db0>
2025-05-07T20:33:39.7801455Z 
2025-05-07T20:33:39.7801633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.7802184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.7802676Z                            module_map=module_map)
2025-05-07T20:33:39.7803055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.7803427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.7803700Z E       ^
2025-05-07T20:33:39.7804180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.7804658Z 
2025-05-07T20:33:39.7805142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.7805687Z 
2025-05-07T20:33:39.9101903Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.9102780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.9103558Z     T=1,
2025-05-07T20:33:39.9103922Z     D=5120,
2025-05-07T20:33:39.9104303Z     scale_ub=None,
2025-05-07T20:33:39.9104714Z     contiguous=False,
2025-05-07T20:33:39.9105150Z     compiled=False,
2025-05-07T20:33:39.9105549Z )
2025-05-07T20:33:39.9106151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.9107087Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:39.9107596Z 
2025-05-07T20:33:39.9107750Z     @given(
2025-05-07T20:33:39.9108200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.9108980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.9109570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.9110203Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.9110590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.9110899Z     )
2025-05-07T20:33:39.9111269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.9111850Z     def test_silu_mul_quant(
2025-05-07T20:33:39.9112115Z         self,
2025-05-07T20:33:39.9112327Z         T: int,
2025-05-07T20:33:39.9112536Z         D: int,
2025-05-07T20:33:39.9112773Z         scale_ub: Optional[float],
2025-05-07T20:33:39.9113066Z         contiguous: bool,
2025-05-07T20:33:39.9113325Z         compiled: bool,
2025-05-07T20:33:39.9113624Z     ) -> None:
2025-05-07T20:33:39.9113858Z         torch.manual_seed(2025)
2025-05-07T20:33:39.9114119Z     
2025-05-07T20:33:39.9114406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.9114775Z     
2025-05-07T20:33:39.9114988Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.9115296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.9115634Z         x = x_sign * x_clamp
2025-05-07T20:33:39.9115898Z         x0 = x[:, :D]
2025-05-07T20:33:39.9116129Z         x1 = x[:, D:]
2025-05-07T20:33:39.9116356Z     
2025-05-07T20:33:39.9116568Z         if contiguous:
2025-05-07T20:33:39.9116818Z             x0 = x0.contiguous()
2025-05-07T20:33:39.9117099Z             x1 = x1.contiguous()
2025-05-07T20:33:39.9117358Z     
2025-05-07T20:33:39.9117562Z         if scale_ub is not None:
2025-05-07T20:33:39.9117856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.9118214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.9118550Z             )
2025-05-07T20:33:39.9118755Z         else:
2025-05-07T20:33:39.9118983Z             scale_ub_tensor = None
2025-05-07T20:33:39.9119256Z     
2025-05-07T20:33:39.9119505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.9119841Z             op = silu_mul_quant
2025-05-07T20:33:39.9120115Z             if compiled:
2025-05-07T20:33:39.9120377Z                 op = torch.compile(op)
2025-05-07T20:33:39.9120698Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9120993Z     
2025-05-07T20:33:39.9121198Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.9121381Z 
2025-05-07T20:33:39.9121491Z moe/activation_test.py:117: 
2025-05-07T20:33:39.9121815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9122162Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.9122466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9123200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.9124091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.9124656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.9125452Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.9126158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.9126722Z     kernel = self.compile(
2025-05-07T20:33:39.9127296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.9127987Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.9128406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9128647Z 
2025-05-07T20:33:39.9128866Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07c82380>
2025-05-07T20:33:39.9129992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.9131495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad075117e0>}
2025-05-07T20:33:39.9132959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.9134553Z context = <triton._C.libtriton.ir.context object at 0x7fad07c6cab0>
2025-05-07T20:33:39.9134857Z 
2025-05-07T20:33:39.9135035Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.9135584Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.9136083Z                            module_map=module_map)
2025-05-07T20:33:39.9136473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.9136851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.9137140Z E       ^
2025-05-07T20:33:39.9137638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.9138108Z 
2025-05-07T20:33:39.9138553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.9139092Z 
2025-05-07T20:33:39.9139205Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.9139664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.9140088Z     T=4096,
2025-05-07T20:33:39.9140296Z     D=7168,
2025-05-07T20:33:39.9140507Z     scale_ub=1200.0,
2025-05-07T20:33:39.9140745Z     contiguous=False,
2025-05-07T20:33:39.9140992Z     compiled=False,
2025-05-07T20:33:39.9141217Z )
2025-05-07T20:33:39.9141550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:39.9142075Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:39.9148012Z 
2025-05-07T20:33:39.9148113Z     @given(
2025-05-07T20:33:39.9148376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:39.9148710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:39.9149048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:39.9149399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:39.9149742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:39.9150049Z     )
2025-05-07T20:33:39.9150427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:39.9150895Z     def test_silu_mul_quant(
2025-05-07T20:33:39.9151159Z         self,
2025-05-07T20:33:39.9151374Z         T: int,
2025-05-07T20:33:39.9151584Z         D: int,
2025-05-07T20:33:39.9151829Z         scale_ub: Optional[float],
2025-05-07T20:33:39.9152125Z         contiguous: bool,
2025-05-07T20:33:39.9152386Z         compiled: bool,
2025-05-07T20:33:39.9152623Z     ) -> None:
2025-05-07T20:33:39.9152934Z         torch.manual_seed(2025)
2025-05-07T20:33:39.9153202Z     
2025-05-07T20:33:39.9153489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:39.9153932Z     
2025-05-07T20:33:39.9154139Z         x_sign = torch.sign(x)
2025-05-07T20:33:39.9154447Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:39.9154782Z         x = x_sign * x_clamp
2025-05-07T20:33:39.9155044Z         x0 = x[:, :D]
2025-05-07T20:33:39.9155271Z         x1 = x[:, D:]
2025-05-07T20:33:39.9155503Z     
2025-05-07T20:33:39.9155704Z         if contiguous:
2025-05-07T20:33:39.9155950Z             x0 = x0.contiguous()
2025-05-07T20:33:39.9156229Z             x1 = x1.contiguous()
2025-05-07T20:33:39.9156488Z     
2025-05-07T20:33:39.9156690Z         if scale_ub is not None:
2025-05-07T20:33:39.9157035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:39.9157394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:39.9157727Z             )
2025-05-07T20:33:39.9157933Z         else:
2025-05-07T20:33:39.9158167Z             scale_ub_tensor = None
2025-05-07T20:33:39.9158435Z     
2025-05-07T20:33:39.9158677Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:39.9159054Z             op = silu_mul_quant
2025-05-07T20:33:39.9159361Z             if compiled:
2025-05-07T20:33:39.9159626Z                 op = torch.compile(op)
2025-05-07T20:33:39.9159943Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9160239Z     
2025-05-07T20:33:39.9160445Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:39.9160629Z 
2025-05-07T20:33:39.9160736Z moe/activation_test.py:117: 
2025-05-07T20:33:39.9161049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9161403Z moe/activation_test.py:115: in fn
2025-05-07T20:33:39.9161706Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:39.9162440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:39.9163159Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:39.9163729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:39.9164457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:39.9165148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:39.9165711Z     kernel = self.compile(
2025-05-07T20:33:39.9166285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:39.9166976Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:39.9167397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:39.9167644Z 
2025-05-07T20:33:39.9167865Z self = <triton.compiler.compiler.ASTSource object at 0x7fad075f2da0>
2025-05-07T20:33:39.9168994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:39.9170427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07512200>}
2025-05-07T20:33:39.9171824Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:39.9172895Z context = <triton._C.libtriton.ir.context object at 0x7fad07ccc0b0>
2025-05-07T20:33:39.9173206Z 
2025-05-07T20:33:39.9173382Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:39.9173978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:39.9174474Z                            module_map=module_map)
2025-05-07T20:33:39.9174869Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:39.9175253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:39.9175533Z E       ^
2025-05-07T20:33:39.9176015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:39.9176489Z 
2025-05-07T20:33:39.9176923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:39.9177455Z 
2025-05-07T20:33:39.9177570Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:39.9178045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:39.9178465Z     T=16384,
2025-05-07T20:33:39.9178675Z     D=7168,
2025-05-07T20:33:39.9178890Z     scale_ub=None,
2025-05-07T20:33:39.9179117Z     contiguous=True,
2025-05-07T20:33:39.9179355Z     compiled=True,
2025-05-07T20:33:39.9179570Z )
2025-05-07T20:33:40.1133932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.1134711Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:40.1135009Z 
2025-05-07T20:33:40.1135094Z     @given(
2025-05-07T20:33:40.1135344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.1135671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.1135999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.1136353Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.1136704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.1137009Z     )
2025-05-07T20:33:40.1137381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.1137847Z     def test_silu_mul_quant(
2025-05-07T20:33:40.1138105Z         self,
2025-05-07T20:33:40.1138315Z         T: int,
2025-05-07T20:33:40.1138527Z         D: int,
2025-05-07T20:33:40.1138755Z         scale_ub: Optional[float],
2025-05-07T20:33:40.1139046Z         contiguous: bool,
2025-05-07T20:33:40.1139304Z         compiled: bool,
2025-05-07T20:33:40.1139545Z     ) -> None:
2025-05-07T20:33:40.1139779Z         torch.manual_seed(2025)
2025-05-07T20:33:40.1140038Z     
2025-05-07T20:33:40.1140327Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.1140693Z     
2025-05-07T20:33:40.1140902Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.1141206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.1141538Z         x = x_sign * x_clamp
2025-05-07T20:33:40.1141798Z         x0 = x[:, :D]
2025-05-07T20:33:40.1142030Z         x1 = x[:, D:]
2025-05-07T20:33:40.1142255Z     
2025-05-07T20:33:40.1142457Z         if contiguous:
2025-05-07T20:33:40.1142706Z             x0 = x0.contiguous()
2025-05-07T20:33:40.1142981Z             x1 = x1.contiguous()
2025-05-07T20:33:40.1143239Z     
2025-05-07T20:33:40.1143447Z         if scale_ub is not None:
2025-05-07T20:33:40.1143734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.1144102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.1144438Z             )
2025-05-07T20:33:40.1144643Z         else:
2025-05-07T20:33:40.1144875Z             scale_ub_tensor = None
2025-05-07T20:33:40.1145145Z     
2025-05-07T20:33:40.1145390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.1145723Z             op = silu_mul_quant
2025-05-07T20:33:40.1145993Z             if compiled:
2025-05-07T20:33:40.1146262Z                 op = torch.compile(op)
2025-05-07T20:33:40.1146578Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1146879Z     
2025-05-07T20:33:40.1147087Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.1147269Z 
2025-05-07T20:33:40.1147449Z moe/activation_test.py:117: 
2025-05-07T20:33:40.1147768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1148126Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.1148429Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1149031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.1149656Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.1150373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.1151098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.1151664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.1152447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.1153143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.1153766Z     kernel = self.compile(
2025-05-07T20:33:40.1154339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.1155118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.1155535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1155782Z 
2025-05-07T20:33:40.1156002Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07ce2da0>
2025-05-07T20:33:40.1157137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.1158585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07513760>}
2025-05-07T20:33:40.1160038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.1161120Z context = <triton._C.libtriton.ir.context object at 0x7fad076e5030>
2025-05-07T20:33:40.1161430Z 
2025-05-07T20:33:40.1161606Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.1162154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.1162644Z                            module_map=module_map)
2025-05-07T20:33:40.1163028Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.1163404Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.1163676Z E       ^
2025-05-07T20:33:40.1164173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.1164651Z 
2025-05-07T20:33:40.1165089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.1165628Z 
2025-05-07T20:33:40.1165749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.1166188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.1166612Z     T=4096,
2025-05-07T20:33:40.1166820Z     D=5120,
2025-05-07T20:33:40.1167023Z     scale_ub=None,
2025-05-07T20:33:40.1167258Z     contiguous=False,
2025-05-07T20:33:40.1167501Z     compiled=True,
2025-05-07T20:33:40.1167712Z )
2025-05-07T20:33:40.1168051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.1168576Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:40.1168861Z 
2025-05-07T20:33:40.1168951Z     @given(
2025-05-07T20:33:40.1169241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.1169576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.1169901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.1170245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.1170605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.1170911Z     )
2025-05-07T20:33:40.1171279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.1171747Z     def test_silu_mul_quant(
2025-05-07T20:33:40.1172007Z         self,
2025-05-07T20:33:40.1172216Z         T: int,
2025-05-07T20:33:40.1172422Z         D: int,
2025-05-07T20:33:40.1172657Z         scale_ub: Optional[float],
2025-05-07T20:33:40.1172948Z         contiguous: bool,
2025-05-07T20:33:40.1173246Z         compiled: bool,
2025-05-07T20:33:40.1173488Z     ) -> None:
2025-05-07T20:33:40.1173718Z         torch.manual_seed(2025)
2025-05-07T20:33:40.1173971Z     
2025-05-07T20:33:40.1174264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.1174628Z     
2025-05-07T20:33:40.1174830Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.1175144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.1175548Z         x = x_sign * x_clamp
2025-05-07T20:33:40.1175842Z         x0 = x[:, :D]
2025-05-07T20:33:40.1176079Z         x1 = x[:, D:]
2025-05-07T20:33:40.1176312Z     
2025-05-07T20:33:40.1176512Z         if contiguous:
2025-05-07T20:33:40.1176765Z             x0 = x0.contiguous()
2025-05-07T20:33:40.1177046Z             x1 = x1.contiguous()
2025-05-07T20:33:40.1177301Z     
2025-05-07T20:33:40.1177506Z         if scale_ub is not None:
2025-05-07T20:33:40.1177802Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.1178158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.1178490Z             )
2025-05-07T20:33:40.1178694Z         else:
2025-05-07T20:33:40.1178918Z             scale_ub_tensor = None
2025-05-07T20:33:40.1179187Z     
2025-05-07T20:33:40.1179433Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.1179785Z             op = silu_mul_quant
2025-05-07T20:33:40.1180054Z             if compiled:
2025-05-07T20:33:40.1180345Z                 op = torch.compile(op)
2025-05-07T20:33:40.1180687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1180980Z     
2025-05-07T20:33:40.1181183Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.1181367Z 
2025-05-07T20:33:40.1181471Z moe/activation_test.py:117: 
2025-05-07T20:33:40.1181784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1182132Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.1182435Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.1183030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.1183616Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.1184316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.1185043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.1185615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.1186326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.1187025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.1187587Z     kernel = self.compile(
2025-05-07T20:33:40.1188161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.1188847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.1189268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.1189563Z 
2025-05-07T20:33:40.1189824Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07c5dba0>
2025-05-07T20:33:40.1190973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.1192415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07610280>}
2025-05-07T20:33:40.1193890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.1195013Z context = <triton._C.libtriton.ir.context object at 0x7fad0769f8b0>
2025-05-07T20:33:40.1195316Z 
2025-05-07T20:33:40.1195503Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.1196049Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.1196543Z                            module_map=module_map)
2025-05-07T20:33:40.1197014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.1197385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.1197667Z E       ^
2025-05-07T20:33:40.1198158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.1198635Z 
2025-05-07T20:33:40.1199076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.1199613Z 
2025-05-07T20:33:40.4480942Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4482254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4483041Z     T=4096,
2025-05-07T20:33:40.4483462Z     D=5120,
2025-05-07T20:33:40.4483802Z     scale_ub=1200.0,
2025-05-07T20:33:40.4484195Z     contiguous=False,
2025-05-07T20:33:40.4484582Z     compiled=False,
2025-05-07T20:33:40.4484912Z )
2025-05-07T20:33:40.4485453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4486356Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:40.4486859Z 
2025-05-07T20:33:40.4487013Z     @given(
2025-05-07T20:33:40.4487418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4488257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4488816Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4489406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4490026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4490546Z     )
2025-05-07T20:33:40.4491176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4491993Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4492435Z         self,
2025-05-07T20:33:40.4492785Z         T: int,
2025-05-07T20:33:40.4493129Z         D: int,
2025-05-07T20:33:40.4493531Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4494023Z         contiguous: bool,
2025-05-07T20:33:40.4494448Z         compiled: bool,
2025-05-07T20:33:40.4494861Z     ) -> None:
2025-05-07T20:33:40.4495250Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4495677Z     
2025-05-07T20:33:40.4496162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4496780Z     
2025-05-07T20:33:40.4497117Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4497634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4498203Z         x = x_sign * x_clamp
2025-05-07T20:33:40.4498631Z         x0 = x[:, :D]
2025-05-07T20:33:40.4499016Z         x1 = x[:, D:]
2025-05-07T20:33:40.4499394Z     
2025-05-07T20:33:40.4500072Z         if contiguous:
2025-05-07T20:33:40.4500511Z             x0 = x0.contiguous()
2025-05-07T20:33:40.4500983Z             x1 = x1.contiguous()
2025-05-07T20:33:40.4501433Z     
2025-05-07T20:33:40.4501766Z         if scale_ub is not None:
2025-05-07T20:33:40.4502278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.4502901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.4503453Z             )
2025-05-07T20:33:40.4503804Z         else:
2025-05-07T20:33:40.4504191Z             scale_ub_tensor = None
2025-05-07T20:33:40.4504641Z     
2025-05-07T20:33:40.4505063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.4505637Z             op = silu_mul_quant
2025-05-07T20:33:40.4506088Z             if compiled:
2025-05-07T20:33:40.4506653Z                 op = torch.compile(op)
2025-05-07T20:33:40.4507166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4507639Z     
2025-05-07T20:33:40.4507976Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.4508276Z 
2025-05-07T20:33:40.4508462Z moe/activation_test.py:117: 
2025-05-07T20:33:40.4508985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4509553Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.4510291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4511507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.4512741Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.4513809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.4515084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.4516323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.4517299Z     kernel = self.compile(
2025-05-07T20:33:40.4518301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.4519508Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.4520232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4520662Z 
2025-05-07T20:33:40.4521027Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0768e290>
2025-05-07T20:33:40.4523016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.4525850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07611000>}
2025-05-07T20:33:40.4528364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.4530249Z context = <triton._C.libtriton.ir.context object at 0x7fad077a8f30>
2025-05-07T20:33:40.4530790Z 
2025-05-07T20:33:40.4531093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.4532052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.4532914Z                            module_map=module_map)
2025-05-07T20:33:40.4533566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.4534203Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.4534672Z E       ^
2025-05-07T20:33:40.4535526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.4536377Z 
2025-05-07T20:33:40.4537314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.4538280Z 
2025-05-07T20:33:40.4538465Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.4539215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.4540000Z     T=4096,
2025-05-07T20:33:40.4540343Z     D=5120,
2025-05-07T20:33:40.4540718Z     scale_ub=1200.0,
2025-05-07T20:33:40.4541120Z     contiguous=False,
2025-05-07T20:33:40.4541535Z     compiled=True,
2025-05-07T20:33:40.4541906Z )
2025-05-07T20:33:40.4542471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.4543368Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:40.4543863Z 
2025-05-07T20:33:40.4544123Z     @given(
2025-05-07T20:33:40.4544524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.4545065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.4545577Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.4546044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.4546505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.4547165Z     )
2025-05-07T20:33:40.4547777Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.4548402Z     def test_silu_mul_quant(
2025-05-07T20:33:40.4548755Z         self,
2025-05-07T20:33:40.4549059Z         T: int,
2025-05-07T20:33:40.4549348Z         D: int,
2025-05-07T20:33:40.4549691Z         scale_ub: Optional[float],
2025-05-07T20:33:40.4550135Z         contiguous: bool,
2025-05-07T20:33:40.4550508Z         compiled: bool,
2025-05-07T20:33:40.4550877Z     ) -> None:
2025-05-07T20:33:40.4551234Z         torch.manual_seed(2025)
2025-05-07T20:33:40.4551617Z     
2025-05-07T20:33:40.4552045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.4552589Z     
2025-05-07T20:33:40.4552918Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.4553361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.4553999Z         x = x_sign * x_clamp
2025-05-07T20:33:40.4554381Z         x0 = x[:, :D]
2025-05-07T20:33:40.4554759Z         x1 = x[:, D:]
2025-05-07T20:33:40.4555124Z     
2025-05-07T20:33:40.4555453Z         if contiguous:
2025-05-07T20:33:40.4555846Z             x0 = x0.contiguous()
2025-05-07T20:33:40.4556298Z             x1 = x1.contiguous()
2025-05-07T20:33:40.4556718Z     
2025-05-07T20:33:40.4557045Z         if scale_ub is not None:
2025-05-07T20:33:40.4557524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.4558099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.4558617Z             )
2025-05-07T20:33:40.4558960Z         else:
2025-05-07T20:33:40.4559305Z             scale_ub_tensor = None
2025-05-07T20:33:40.4559712Z     
2025-05-07T20:33:40.4560111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.4560655Z             op = silu_mul_quant
2025-05-07T20:33:40.4561079Z             if compiled:
2025-05-07T20:33:40.4561501Z                 op = torch.compile(op)
2025-05-07T20:33:40.4562018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4572262Z     
2025-05-07T20:33:40.4572668Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.4572960Z 
2025-05-07T20:33:40.4573152Z moe/activation_test.py:117: 
2025-05-07T20:33:40.4573683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4574304Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.4574826Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.4575865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.4576916Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.4578204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.4579466Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.4580447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.4581710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.4582885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.4583813Z     kernel = self.compile(
2025-05-07T20:33:40.4584787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.4585965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.4586679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.4587187Z 
2025-05-07T20:33:40.4587552Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07ce18d0>
2025-05-07T20:33:40.4589496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.4592093Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07610700>}
2025-05-07T20:33:40.4594407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.4596160Z context = <triton._C.libtriton.ir.context object at 0x7fad07741530>
2025-05-07T20:33:40.4596662Z 
2025-05-07T20:33:40.4596931Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.4597820Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.4598625Z                            module_map=module_map)
2025-05-07T20:33:40.4599225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.4599816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.4600260Z E       ^
2025-05-07T20:33:40.4601053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.4601856Z 
2025-05-07T20:33:40.4602562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.4603463Z 
2025-05-07T20:33:40.5871479Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.5872333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.5873095Z     T=2048,
2025-05-07T20:33:40.5873445Z     D=7168,
2025-05-07T20:33:40.5873878Z     scale_ub=1200.0,
2025-05-07T20:33:40.5874278Z     contiguous=False,
2025-05-07T20:33:40.5874662Z     compiled=False,
2025-05-07T20:33:40.5875003Z )
2025-05-07T20:33:40.5875544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.5876453Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:40.5876975Z 
2025-05-07T20:33:40.5877125Z     @given(
2025-05-07T20:33:40.5877527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.5878095Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.5878649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.5879247Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.5879844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.5880351Z     )
2025-05-07T20:33:40.5880987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.5881795Z     def test_silu_mul_quant(
2025-05-07T20:33:40.5882223Z         self,
2025-05-07T20:33:40.5882918Z         T: int,
2025-05-07T20:33:40.5883285Z         D: int,
2025-05-07T20:33:40.5883666Z         scale_ub: Optional[float],
2025-05-07T20:33:40.5884153Z         contiguous: bool,
2025-05-07T20:33:40.5884584Z         compiled: bool,
2025-05-07T20:33:40.5884985Z     ) -> None:
2025-05-07T20:33:40.5885373Z         torch.manual_seed(2025)
2025-05-07T20:33:40.5885812Z     
2025-05-07T20:33:40.5886295Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.5886901Z     
2025-05-07T20:33:40.5887247Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.5887763Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.5888312Z         x = x_sign * x_clamp
2025-05-07T20:33:40.5888767Z         x0 = x[:, :D]
2025-05-07T20:33:40.5889178Z         x1 = x[:, D:]
2025-05-07T20:33:40.5889692Z     
2025-05-07T20:33:40.5890024Z         if contiguous:
2025-05-07T20:33:40.5890436Z             x0 = x0.contiguous()
2025-05-07T20:33:40.5890893Z             x1 = x1.contiguous()
2025-05-07T20:33:40.5891330Z     
2025-05-07T20:33:40.5891673Z         if scale_ub is not None:
2025-05-07T20:33:40.5892154Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.5892753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.5893441Z             )
2025-05-07T20:33:40.5893880Z         else:
2025-05-07T20:33:40.5894263Z             scale_ub_tensor = None
2025-05-07T20:33:40.5894734Z     
2025-05-07T20:33:40.5895171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.5895733Z             op = silu_mul_quant
2025-05-07T20:33:40.5896196Z             if compiled:
2025-05-07T20:33:40.5896624Z                 op = torch.compile(op)
2025-05-07T20:33:40.5897137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5897623Z     
2025-05-07T20:33:40.5897965Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.5898255Z 
2025-05-07T20:33:40.5898430Z moe/activation_test.py:117: 
2025-05-07T20:33:40.5898943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5899531Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.5900017Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5901253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.5902471Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.5903461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.5904710Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.5905943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.5906935Z     kernel = self.compile(
2025-05-07T20:33:40.5907929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.5909138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.5909867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5910292Z 
2025-05-07T20:33:40.5910678Z self = <triton.compiler.compiler.ASTSource object at 0x7fad076b9b40>
2025-05-07T20:33:40.5912670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.5915404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07611240>}
2025-05-07T20:33:40.5917938Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.5919878Z context = <triton._C.libtriton.ir.context object at 0x7fad077dabb0>
2025-05-07T20:33:40.5920458Z 
2025-05-07T20:33:40.5920765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.5921724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.5922595Z                            module_map=module_map)
2025-05-07T20:33:40.5923252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.5924189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.5924679Z E       ^
2025-05-07T20:33:40.5925541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.5926512Z 
2025-05-07T20:33:40.5927301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.5928257Z 
2025-05-07T20:33:40.5928450Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.5929202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.5929982Z     T=1,
2025-05-07T20:33:40.5930456Z     D=7168,
2025-05-07T20:33:40.5930798Z     scale_ub=None,
2025-05-07T20:33:40.5931269Z     contiguous=True,
2025-05-07T20:33:40.5931678Z     compiled=False,
2025-05-07T20:33:40.5932039Z )
2025-05-07T20:33:40.5932618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.5933514Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:40.5933998Z 
2025-05-07T20:33:40.5934140Z     @given(
2025-05-07T20:33:40.5934560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.5935144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.5935695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.5936267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.5936798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.5937229Z     )
2025-05-07T20:33:40.5937725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.5938385Z     def test_silu_mul_quant(
2025-05-07T20:33:40.5938756Z         self,
2025-05-07T20:33:40.5939052Z         T: int,
2025-05-07T20:33:40.5939355Z         D: int,
2025-05-07T20:33:40.5939682Z         scale_ub: Optional[float],
2025-05-07T20:33:40.5940080Z         contiguous: bool,
2025-05-07T20:33:40.5940451Z         compiled: bool,
2025-05-07T20:33:40.5940803Z     ) -> None:
2025-05-07T20:33:40.5941133Z         torch.manual_seed(2025)
2025-05-07T20:33:40.5941520Z     
2025-05-07T20:33:40.5941956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.5942489Z     
2025-05-07T20:33:40.5942790Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.5943245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.5943750Z         x = x_sign * x_clamp
2025-05-07T20:33:40.5944146Z         x0 = x[:, :D]
2025-05-07T20:33:40.5944494Z         x1 = x[:, D:]
2025-05-07T20:33:40.5944819Z     
2025-05-07T20:33:40.5945114Z         if contiguous:
2025-05-07T20:33:40.5945495Z             x0 = x0.contiguous()
2025-05-07T20:33:40.5945920Z             x1 = x1.contiguous()
2025-05-07T20:33:40.5946297Z     
2025-05-07T20:33:40.5946608Z         if scale_ub is not None:
2025-05-07T20:33:40.5947032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.5947540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.5948053Z             )
2025-05-07T20:33:40.5948358Z         else:
2025-05-07T20:33:40.5948706Z             scale_ub_tensor = None
2025-05-07T20:33:40.5949094Z     
2025-05-07T20:33:40.5949472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.5950006Z             op = silu_mul_quant
2025-05-07T20:33:40.5950439Z             if compiled:
2025-05-07T20:33:40.5951007Z                 op = torch.compile(op)
2025-05-07T20:33:40.5951513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5951989Z     
2025-05-07T20:33:40.5952324Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.5952609Z 
2025-05-07T20:33:40.5952790Z moe/activation_test.py:117: 
2025-05-07T20:33:40.5953296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5953986Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.5954488Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.5955696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.5956897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.5957827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.5959064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.5960285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.5961224Z     kernel = self.compile(
2025-05-07T20:33:40.5962254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.5963439Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.5964134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.5964544Z 
2025-05-07T20:33:40.5964898Z self = <triton.compiler.compiler.ASTSource object at 0x7fad077d0670>
2025-05-07T20:33:40.5966793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.5969235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07612050>}
2025-05-07T20:33:40.5971657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.5973451Z context = <triton._C.libtriton.ir.context object at 0x7fad07bd6a30>
2025-05-07T20:33:40.5973950Z 
2025-05-07T20:33:40.5974242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.5975146Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.5975956Z                            module_map=module_map)
2025-05-07T20:33:40.5976590Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.5977215Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.5977658Z E       ^
2025-05-07T20:33:40.5978470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.5979265Z 
2025-05-07T20:33:40.5980025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.5980927Z 
2025-05-07T20:33:40.5981116Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.5981821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.5982521Z     T=16384,
2025-05-07T20:33:40.5982862Z     D=7168,
2025-05-07T20:33:40.5983192Z     scale_ub=1200.0,
2025-05-07T20:33:40.5983584Z     contiguous=False,
2025-05-07T20:33:40.5983976Z     compiled=True,
2025-05-07T20:33:40.8677194Z )
2025-05-07T20:33:40.8677885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8678788Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:40.8679289Z 
2025-05-07T20:33:40.8679735Z     @given(
2025-05-07T20:33:40.8680161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8680701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8681246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8681851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8682342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8682778Z     )
2025-05-07T20:33:40.8683319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8684022Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8684412Z         self,
2025-05-07T20:33:40.8684748Z         T: int,
2025-05-07T20:33:40.8685104Z         D: int,
2025-05-07T20:33:40.8685474Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8686081Z         contiguous: bool,
2025-05-07T20:33:40.8686461Z         compiled: bool,
2025-05-07T20:33:40.8686814Z     ) -> None:
2025-05-07T20:33:40.8687172Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8687576Z     
2025-05-07T20:33:40.8688022Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8688600Z     
2025-05-07T20:33:40.8688924Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.8689667Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.8690244Z         x = x_sign * x_clamp
2025-05-07T20:33:40.8690689Z         x0 = x[:, :D]
2025-05-07T20:33:40.8691043Z         x1 = x[:, D:]
2025-05-07T20:33:40.8691415Z     
2025-05-07T20:33:40.8691762Z         if contiguous:
2025-05-07T20:33:40.8692167Z             x0 = x0.contiguous()
2025-05-07T20:33:40.8692633Z             x1 = x1.contiguous()
2025-05-07T20:33:40.8693066Z     
2025-05-07T20:33:40.8693405Z         if scale_ub is not None:
2025-05-07T20:33:40.8693894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.8694480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.8695045Z             )
2025-05-07T20:33:40.8695409Z         else:
2025-05-07T20:33:40.8695800Z             scale_ub_tensor = None
2025-05-07T20:33:40.8696256Z     
2025-05-07T20:33:40.8696657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.8697224Z             op = silu_mul_quant
2025-05-07T20:33:40.8697669Z             if compiled:
2025-05-07T20:33:40.8698105Z                 op = torch.compile(op)
2025-05-07T20:33:40.8698636Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.8699138Z     
2025-05-07T20:33:40.8699475Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.8699785Z 
2025-05-07T20:33:40.8699963Z moe/activation_test.py:117: 
2025-05-07T20:33:40.8700501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.8701103Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.8701603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.8702637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.8703677Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.8704868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.8706111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.8707095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.8708349Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.8709575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.8710607Z     kernel = self.compile(
2025-05-07T20:33:40.8711610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.8712811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.8713755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.8714175Z 
2025-05-07T20:33:40.8714538Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07b82920>
2025-05-07T20:33:40.8716400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.8718886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07613490>}
2025-05-07T20:33:40.8721322Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.8723248Z context = <triton._C.libtriton.ir.context object at 0x7fad07b5b070>
2025-05-07T20:33:40.8724116Z 
2025-05-07T20:33:40.8724427Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.8725373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.8726443Z                            module_map=module_map)
2025-05-07T20:33:40.8727097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.8727690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.8728131Z E       ^
2025-05-07T20:33:40.8728968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.8729807Z 
2025-05-07T20:33:40.8730560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.8731495Z 
2025-05-07T20:33:40.8731689Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8732428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8733151Z     T=1,
2025-05-07T20:33:40.8733467Z     D=7168,
2025-05-07T20:33:40.8733806Z     scale_ub=None,
2025-05-07T20:33:40.8734185Z     contiguous=False,
2025-05-07T20:33:40.8734586Z     compiled=False,
2025-05-07T20:33:40.8734940Z )
2025-05-07T20:33:40.8735509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.8736381Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:40.8736847Z 
2025-05-07T20:33:40.8736990Z     @given(
2025-05-07T20:33:40.8737383Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.8737933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.8738485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.8739075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.8739653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.8740160Z     )
2025-05-07T20:33:40.8740781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.8741574Z     def test_silu_mul_quant(
2025-05-07T20:33:40.8742008Z         self,
2025-05-07T20:33:40.8742347Z         T: int,
2025-05-07T20:33:40.8742701Z         D: int,
2025-05-07T20:33:40.8743082Z         scale_ub: Optional[float],
2025-05-07T20:33:40.8743561Z         contiguous: bool,
2025-05-07T20:33:40.8743972Z         compiled: bool,
2025-05-07T20:33:40.8744375Z     ) -> None:
2025-05-07T20:33:40.8744755Z         torch.manual_seed(2025)
2025-05-07T20:33:40.8745178Z     
2025-05-07T20:33:40.8745650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.8746251Z     
2025-05-07T20:33:40.8746583Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.8747105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.8747659Z         x = x_sign * x_clamp
2025-05-07T20:33:40.8748081Z         x0 = x[:, :D]
2025-05-07T20:33:40.8748587Z         x1 = x[:, D:]
2025-05-07T20:33:40.8748962Z     
2025-05-07T20:33:40.8749281Z         if contiguous:
2025-05-07T20:33:40.8749690Z             x0 = x0.contiguous()
2025-05-07T20:33:40.8750194Z             x1 = x1.contiguous()
2025-05-07T20:33:40.8750612Z     
2025-05-07T20:33:40.8750951Z         if scale_ub is not None:
2025-05-07T20:33:40.8751441Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.8752037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.8752586Z             )
2025-05-07T20:33:40.8752924Z         else:
2025-05-07T20:33:40.8753290Z             scale_ub_tensor = None
2025-05-07T20:33:40.8753830Z     
2025-05-07T20:33:40.8754242Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.8754802Z             op = silu_mul_quant
2025-05-07T20:33:40.8755369Z             if compiled:
2025-05-07T20:33:40.8755808Z                 op = torch.compile(op)
2025-05-07T20:33:40.8756336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.8756824Z     
2025-05-07T20:33:40.8757168Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.8757460Z 
2025-05-07T20:33:40.8757647Z moe/activation_test.py:117: 
2025-05-07T20:33:40.8758156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.8758892Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.8759436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.8760742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.8762015Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.8762989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.8764238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.8765461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.8766428Z     kernel = self.compile(
2025-05-07T20:33:40.8767393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.8768555Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.8769250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.8769669Z 
2025-05-07T20:33:40.8770029Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07baeaa0>
2025-05-07T20:33:40.8772000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.8774348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad076137f0>}
2025-05-07T20:33:40.8776854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.8778739Z context = <triton._C.libtriton.ir.context object at 0x7fad073bce30>
2025-05-07T20:33:40.8779280Z 
2025-05-07T20:33:40.8779573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.8780527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.8781385Z                            module_map=module_map)
2025-05-07T20:33:40.8782024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.8782651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.8783121Z E       ^
2025-05-07T20:33:40.8783959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.8784888Z 
2025-05-07T20:33:40.8785654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.8786612Z 
2025-05-07T20:33:40.8786802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.8787549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.8788275Z     T=2048,
2025-05-07T20:33:40.8788607Z     D=7168,
2025-05-07T20:33:40.8788946Z     scale_ub=None,
2025-05-07T20:33:40.8789323Z     contiguous=False,
2025-05-07T20:33:40.8789721Z     compiled=True,
2025-05-07T20:33:40.8790086Z )
2025-05-07T20:33:40.9790836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9791761Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:40.9792851Z 
2025-05-07T20:33:40.9792987Z     @given(
2025-05-07T20:33:40.9793375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9793943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9804605Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9805196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9805963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9806477Z     )
2025-05-07T20:33:40.9807207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9808011Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9808438Z         self,
2025-05-07T20:33:40.9808769Z         T: int,
2025-05-07T20:33:40.9809111Z         D: int,
2025-05-07T20:33:40.9809493Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9810015Z         contiguous: bool,
2025-05-07T20:33:40.9810535Z         compiled: bool,
2025-05-07T20:33:40.9810933Z     ) -> None:
2025-05-07T20:33:40.9811314Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9811744Z     
2025-05-07T20:33:40.9812227Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9812841Z     
2025-05-07T20:33:40.9813187Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.9813691Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.9814245Z         x = x_sign * x_clamp
2025-05-07T20:33:40.9814675Z         x0 = x[:, :D]
2025-05-07T20:33:40.9815038Z         x1 = x[:, D:]
2025-05-07T20:33:40.9815402Z     
2025-05-07T20:33:40.9815730Z         if contiguous:
2025-05-07T20:33:40.9816133Z             x0 = x0.contiguous()
2025-05-07T20:33:40.9816582Z             x1 = x1.contiguous()
2025-05-07T20:33:40.9817005Z     
2025-05-07T20:33:40.9817341Z         if scale_ub is not None:
2025-05-07T20:33:40.9817805Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.9818394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.9818948Z             )
2025-05-07T20:33:40.9819272Z         else:
2025-05-07T20:33:40.9819623Z             scale_ub_tensor = None
2025-05-07T20:33:40.9820054Z     
2025-05-07T20:33:40.9820445Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.9820994Z             op = silu_mul_quant
2025-05-07T20:33:40.9821450Z             if compiled:
2025-05-07T20:33:40.9821888Z                 op = torch.compile(op)
2025-05-07T20:33:40.9822427Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.9822923Z     
2025-05-07T20:33:40.9823254Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.9823557Z 
2025-05-07T20:33:40.9823731Z moe/activation_test.py:117: 
2025-05-07T20:33:40.9824594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.9825200Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.9825688Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.9826719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.9827744Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.9829097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.9830389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.9831350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.9832612Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.9833931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.9834852Z     kernel = self.compile(
2025-05-07T20:33:40.9835826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.9837029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.9837859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.9838273Z 
2025-05-07T20:33:40.9838646Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07b716f0>
2025-05-07T20:33:40.9840714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.9843365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07300280>}
2025-05-07T20:33:40.9845859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.9847739Z context = <triton._C.libtriton.ir.context object at 0x7fad0739a030>
2025-05-07T20:33:40.9848262Z 
2025-05-07T20:33:40.9848562Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.9849501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.9850352Z                            module_map=module_map)
2025-05-07T20:33:40.9851004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.9851620Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.9852082Z E       ^
2025-05-07T20:33:40.9852930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.9853761Z 
2025-05-07T20:33:40.9854461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.9855194Z 
2025-05-07T20:33:40.9855344Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:40.9855940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:40.9856518Z     T=4096,
2025-05-07T20:33:40.9856787Z     D=7168,
2025-05-07T20:33:40.9857077Z     scale_ub=None,
2025-05-07T20:33:40.9857398Z     contiguous=False,
2025-05-07T20:33:40.9857717Z     compiled=True,
2025-05-07T20:33:40.9858024Z )
2025-05-07T20:33:40.9858464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:40.9859196Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:40.9859629Z 
2025-05-07T20:33:40.9859754Z     @given(
2025-05-07T20:33:40.9860110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:40.9860583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:40.9861060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:40.9861585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:40.9862092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:40.9862521Z     )
2025-05-07T20:33:40.9863057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:40.9863859Z     def test_silu_mul_quant(
2025-05-07T20:33:40.9864235Z         self,
2025-05-07T20:33:40.9864527Z         T: int,
2025-05-07T20:33:40.9864828Z         D: int,
2025-05-07T20:33:40.9865161Z         scale_ub: Optional[float],
2025-05-07T20:33:40.9865592Z         contiguous: bool,
2025-05-07T20:33:40.9865950Z         compiled: bool,
2025-05-07T20:33:40.9866328Z     ) -> None:
2025-05-07T20:33:40.9866646Z         torch.manual_seed(2025)
2025-05-07T20:33:40.9867010Z     
2025-05-07T20:33:40.9867466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:40.9868040Z     
2025-05-07T20:33:40.9868368Z         x_sign = torch.sign(x)
2025-05-07T20:33:40.9868858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:40.9869378Z         x = x_sign * x_clamp
2025-05-07T20:33:40.9869882Z         x0 = x[:, :D]
2025-05-07T20:33:40.9870247Z         x1 = x[:, D:]
2025-05-07T20:33:40.9870595Z     
2025-05-07T20:33:40.9870908Z         if contiguous:
2025-05-07T20:33:40.9871304Z             x0 = x0.contiguous()
2025-05-07T20:33:40.9871732Z             x1 = x1.contiguous()
2025-05-07T20:33:40.9872141Z     
2025-05-07T20:33:40.9872465Z         if scale_ub is not None:
2025-05-07T20:33:40.9872918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:40.9873680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:40.9874206Z             )
2025-05-07T20:33:40.9874539Z         else:
2025-05-07T20:33:40.9874886Z             scale_ub_tensor = None
2025-05-07T20:33:40.9875311Z     
2025-05-07T20:33:40.9875699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:40.9876219Z             op = silu_mul_quant
2025-05-07T20:33:40.9876635Z             if compiled:
2025-05-07T20:33:40.9877057Z                 op = torch.compile(op)
2025-05-07T20:33:40.9877543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.9878008Z     
2025-05-07T20:33:40.9878335Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:40.9878612Z 
2025-05-07T20:33:40.9878787Z moe/activation_test.py:117: 
2025-05-07T20:33:40.9879284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.9879844Z moe/activation_test.py:115: in fn
2025-05-07T20:33:40.9880318Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:40.9881267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:40.9882222Z     return fn(*args, **kwargs)
2025-05-07T20:33:40.9883345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:40.9884509Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:40.9885415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:40.9886581Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:40.9887721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:40.9888622Z     kernel = self.compile(
2025-05-07T20:33:40.9889537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:40.9890661Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:40.9891320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:40.9891716Z 
2025-05-07T20:33:40.9892057Z self = <triton.compiler.compiler.ASTSource object at 0x7fad077ac700>
2025-05-07T20:33:40.9893893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:40.9896321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07303520>}
2025-05-07T20:33:40.9898634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:40.9900433Z context = <triton._C.libtriton.ir.context object at 0x7fad07244cb0>
2025-05-07T20:33:40.9900926Z 
2025-05-07T20:33:40.9901197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:40.9902081Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:40.9902880Z                            module_map=module_map)
2025-05-07T20:33:40.9903478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:40.9904120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:40.9904565Z E       ^
2025-05-07T20:33:40.9905350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:40.9906130Z 
2025-05-07T20:33:40.9906839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:40.9907772Z 
2025-05-07T20:33:41.3314072Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3314910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3315661Z     T=16384,
2025-05-07T20:33:41.3316003Z     D=5120,
2025-05-07T20:33:41.3316349Z     scale_ub=1200.0,
2025-05-07T20:33:41.3316745Z     contiguous=False,
2025-05-07T20:33:41.3317115Z     compiled=False,
2025-05-07T20:33:41.3317451Z )
2025-05-07T20:33:41.3317929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3318805Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:41.3319321Z 
2025-05-07T20:33:41.3319460Z     @given(
2025-05-07T20:33:41.3319870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3320431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3320977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3321579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3322172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3322683Z     )
2025-05-07T20:33:41.3323314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3324557Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3324982Z         self,
2025-05-07T20:33:41.3325325Z         T: int,
2025-05-07T20:33:41.3325682Z         D: int,
2025-05-07T20:33:41.3326057Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3326538Z         contiguous: bool,
2025-05-07T20:33:41.3326972Z         compiled: bool,
2025-05-07T20:33:41.3327366Z     ) -> None:
2025-05-07T20:33:41.3327748Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3328182Z     
2025-05-07T20:33:41.3328682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3329301Z     
2025-05-07T20:33:41.3329645Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.3330206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.3330795Z         x = x_sign * x_clamp
2025-05-07T20:33:41.3331231Z         x0 = x[:, :D]
2025-05-07T20:33:41.3331613Z         x1 = x[:, D:]
2025-05-07T20:33:41.3331989Z     
2025-05-07T20:33:41.3332319Z         if contiguous:
2025-05-07T20:33:41.3332725Z             x0 = x0.contiguous()
2025-05-07T20:33:41.3333194Z             x1 = x1.contiguous()
2025-05-07T20:33:41.3333631Z     
2025-05-07T20:33:41.3333963Z         if scale_ub is not None:
2025-05-07T20:33:41.3334457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.3335062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.3335609Z             )
2025-05-07T20:33:41.3335958Z         else:
2025-05-07T20:33:41.3336536Z             scale_ub_tensor = None
2025-05-07T20:33:41.3336997Z     
2025-05-07T20:33:41.3337420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3337993Z             op = silu_mul_quant
2025-05-07T20:33:41.3338454Z             if compiled:
2025-05-07T20:33:41.3338901Z                 op = torch.compile(op)
2025-05-07T20:33:41.3339431Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3339925Z     
2025-05-07T20:33:41.3340247Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.3340542Z 
2025-05-07T20:33:41.3340718Z moe/activation_test.py:117: 
2025-05-07T20:33:41.3341247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3341822Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.3342358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3343732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.3344972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.3345928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.3347194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.3348673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.3349680Z     kernel = self.compile(
2025-05-07T20:33:41.3350727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.3351943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.3352675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3353101Z 
2025-05-07T20:33:41.3353480Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07370160>
2025-05-07T20:33:41.3355602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.3358182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07302a70>}
2025-05-07T20:33:41.3360709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.3362649Z context = <triton._C.libtriton.ir.context object at 0x7fad0722e070>
2025-05-07T20:33:41.3363185Z 
2025-05-07T20:33:41.3363498Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.3364448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.3365327Z                            module_map=module_map)
2025-05-07T20:33:41.3365992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.3366617Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.3367093Z E       ^
2025-05-07T20:33:41.3367965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.3368811Z 
2025-05-07T20:33:41.3369595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.3370609Z 
2025-05-07T20:33:41.3370797Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3371555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3372298Z     T=16384,
2025-05-07T20:33:41.3372638Z     D=5120,
2025-05-07T20:33:41.3372985Z     scale_ub=1200.0,
2025-05-07T20:33:41.3373385Z     contiguous=True,
2025-05-07T20:33:41.3373774Z     compiled=True,
2025-05-07T20:33:41.3374226Z )
2025-05-07T20:33:41.3374803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3375711Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:41.3376218Z 
2025-05-07T20:33:41.3376356Z     @given(
2025-05-07T20:33:41.3376770Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3377340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3377889Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3378495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3379096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3379602Z     )
2025-05-07T20:33:41.3380192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3380906Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3381256Z         self,
2025-05-07T20:33:41.3381545Z         T: int,
2025-05-07T20:33:41.3381845Z         D: int,
2025-05-07T20:33:41.3382175Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3382576Z         contiguous: bool,
2025-05-07T20:33:41.3382938Z         compiled: bool,
2025-05-07T20:33:41.3383282Z     ) -> None:
2025-05-07T20:33:41.3383676Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3384095Z     
2025-05-07T20:33:41.3384488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3384987Z     
2025-05-07T20:33:41.3385297Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.3385759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.3386241Z         x = x_sign * x_clamp
2025-05-07T20:33:41.3386609Z         x0 = x[:, :D]
2025-05-07T20:33:41.3386945Z         x1 = x[:, D:]
2025-05-07T20:33:41.3387270Z     
2025-05-07T20:33:41.3387578Z         if contiguous:
2025-05-07T20:33:41.3387961Z             x0 = x0.contiguous()
2025-05-07T20:33:41.3388363Z             x1 = x1.contiguous()
2025-05-07T20:33:41.3388747Z     
2025-05-07T20:33:41.3389057Z         if scale_ub is not None:
2025-05-07T20:33:41.3389482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.3390026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.3390526Z             )
2025-05-07T20:33:41.3390822Z         else:
2025-05-07T20:33:41.3391148Z             scale_ub_tensor = None
2025-05-07T20:33:41.3391549Z     
2025-05-07T20:33:41.3391923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3392402Z             op = silu_mul_quant
2025-05-07T20:33:41.3392835Z             if compiled:
2025-05-07T20:33:41.3393259Z                 op = torch.compile(op)
2025-05-07T20:33:41.3393876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3394345Z     
2025-05-07T20:33:41.3394691Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.3394977Z 
2025-05-07T20:33:41.3395151Z moe/activation_test.py:117: 
2025-05-07T20:33:41.3395680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3396277Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.3396789Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3397794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.3398796Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.3399971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.3401193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.3402125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.3403382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.3404633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.3405620Z     kernel = self.compile(
2025-05-07T20:33:41.3406727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.3407945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.3408677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3409110Z 
2025-05-07T20:33:41.3409477Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07311180>
2025-05-07T20:33:41.3411499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.3414084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07302560>}
2025-05-07T20:33:41.3416691Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.3418595Z context = <triton._C.libtriton.ir.context object at 0x7fad07202d30>
2025-05-07T20:33:41.3419206Z 
2025-05-07T20:33:41.3419563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.3420533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.3421395Z                            module_map=module_map)
2025-05-07T20:33:41.3422045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.3422683Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.3423154Z E       ^
2025-05-07T20:33:41.3424314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.3425176Z 
2025-05-07T20:33:41.3425960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.3426923Z 
2025-05-07T20:33:41.5381160Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.5382088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.5382852Z     T=16384,
2025-05-07T20:33:41.5383209Z     D=5120,
2025-05-07T20:33:41.5383574Z     scale_ub=None,
2025-05-07T20:33:41.5383964Z     contiguous=False,
2025-05-07T20:33:41.5384370Z     compiled=True,
2025-05-07T20:33:41.5384754Z )
2025-05-07T20:33:41.5385348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.5386273Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.5386737Z 
2025-05-07T20:33:41.5386866Z     @given(
2025-05-07T20:33:41.5387240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.5387757Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.5388266Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.5388832Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.5389407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.5389915Z     )
2025-05-07T20:33:41.5390543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.5391292Z     def test_silu_mul_quant(
2025-05-07T20:33:41.5391700Z         self,
2025-05-07T20:33:41.5392035Z         T: int,
2025-05-07T20:33:41.5392388Z         D: int,
2025-05-07T20:33:41.5392768Z         scale_ub: Optional[float],
2025-05-07T20:33:41.5393247Z         contiguous: bool,
2025-05-07T20:33:41.5393768Z         compiled: bool,
2025-05-07T20:33:41.5394177Z     ) -> None:
2025-05-07T20:33:41.5394588Z         torch.manual_seed(2025)
2025-05-07T20:33:41.5395067Z     
2025-05-07T20:33:41.5395552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.5396173Z     
2025-05-07T20:33:41.5396898Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.5397458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.5398033Z         x = x_sign * x_clamp
2025-05-07T20:33:41.5398481Z         x0 = x[:, :D]
2025-05-07T20:33:41.5398887Z         x1 = x[:, D:]
2025-05-07T20:33:41.5399273Z     
2025-05-07T20:33:41.5399626Z         if contiguous:
2025-05-07T20:33:41.5400068Z             x0 = x0.contiguous()
2025-05-07T20:33:41.5400553Z             x1 = x1.contiguous()
2025-05-07T20:33:41.5401018Z     
2025-05-07T20:33:41.5401384Z         if scale_ub is not None:
2025-05-07T20:33:41.5401885Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.5402519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.5403112Z             )
2025-05-07T20:33:41.5403636Z         else:
2025-05-07T20:33:41.5404036Z             scale_ub_tensor = None
2025-05-07T20:33:41.5404531Z     
2025-05-07T20:33:41.5404980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.5405599Z             op = silu_mul_quant
2025-05-07T20:33:41.5406086Z             if compiled:
2025-05-07T20:33:41.5406556Z                 op = torch.compile(op)
2025-05-07T20:33:41.5407116Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.5407785Z     
2025-05-07T20:33:41.5408316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.5419043Z 
2025-05-07T20:33:41.5419271Z moe/activation_test.py:117: 
2025-05-07T20:33:41.5419831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.5420459Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.5420967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.5421968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.5423049Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.5424635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.5425954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.5426957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.5428251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.5429513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.5430542Z     kernel = self.compile(
2025-05-07T20:33:41.5431573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.5432807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.5433637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.5434077Z 
2025-05-07T20:33:41.5434480Z self = <triton.compiler.compiler.ASTSource object at 0x7fad072c43d0>
2025-05-07T20:33:41.5436542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.5439515Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07303760>}
2025-05-07T20:33:41.5442103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.5444084Z context = <triton._C.libtriton.ir.context object at 0x7fad074294b0>
2025-05-07T20:33:41.5444634Z 
2025-05-07T20:33:41.5444944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.5446101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.5446990Z                            module_map=module_map)
2025-05-07T20:33:41.5447669Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.5448322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.5448823Z E       ^
2025-05-07T20:33:41.5449711Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.5450581Z 
2025-05-07T20:33:41.5451368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.5452351Z 
2025-05-07T20:33:41.5452543Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.5453324Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.5454288Z     T=2048,
2025-05-07T20:33:41.5454635Z     D=5120,
2025-05-07T20:33:41.5454987Z     scale_ub=None,
2025-05-07T20:33:41.5455400Z     contiguous=False,
2025-05-07T20:33:41.5455807Z     compiled=True,
2025-05-07T20:33:41.5456196Z )
2025-05-07T20:33:41.6567969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.6569407Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:41.6570122Z 
2025-05-07T20:33:41.6570309Z     @given(
2025-05-07T20:33:41.6570726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.6571309Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.6571821Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.6572430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.6573063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.6573609Z     )
2025-05-07T20:33:41.6574299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.6575173Z     def test_silu_mul_quant(
2025-05-07T20:33:41.6575629Z         self,
2025-05-07T20:33:41.6576007Z         T: int,
2025-05-07T20:33:41.6576385Z         D: int,
2025-05-07T20:33:41.6576790Z         scale_ub: Optional[float],
2025-05-07T20:33:41.6577314Z         contiguous: bool,
2025-05-07T20:33:41.6577785Z         compiled: bool,
2025-05-07T20:33:41.6578206Z     ) -> None:
2025-05-07T20:33:41.6578624Z         torch.manual_seed(2025)
2025-05-07T20:33:41.6579086Z     
2025-05-07T20:33:41.6579600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.6580249Z     
2025-05-07T20:33:41.6580618Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.6581170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.6581756Z         x = x_sign * x_clamp
2025-05-07T20:33:41.6582215Z         x0 = x[:, :D]
2025-05-07T20:33:41.6582633Z         x1 = x[:, D:]
2025-05-07T20:33:41.6583021Z     
2025-05-07T20:33:41.6583374Z         if contiguous:
2025-05-07T20:33:41.6583815Z             x0 = x0.contiguous()
2025-05-07T20:33:41.6584311Z             x1 = x1.contiguous()
2025-05-07T20:33:41.6584775Z     
2025-05-07T20:33:41.6585140Z         if scale_ub is not None:
2025-05-07T20:33:41.6585651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.6586292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.6586890Z             )
2025-05-07T20:33:41.6587249Z         else:
2025-05-07T20:33:41.6587650Z             scale_ub_tensor = None
2025-05-07T20:33:41.6588134Z     
2025-05-07T20:33:41.6588560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.6589168Z             op = silu_mul_quant
2025-05-07T20:33:41.6589642Z             if compiled:
2025-05-07T20:33:41.6590108Z                 op = torch.compile(op)
2025-05-07T20:33:41.6590662Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.6591194Z     
2025-05-07T20:33:41.6591558Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.6591877Z 
2025-05-07T20:33:41.6592063Z moe/activation_test.py:117: 
2025-05-07T20:33:41.6592784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.6593429Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.6594079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.6595175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.6596236Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.6597477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.6598783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.6599813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.6601267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.6602570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.6603638Z     kernel = self.compile(
2025-05-07T20:33:41.6604712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.6606130Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.6606967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.6607435Z 
2025-05-07T20:33:41.6607831Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07494370>
2025-05-07T20:33:41.6609996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.6612799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad074543a0>}
2025-05-07T20:33:41.6615519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.6617499Z context = <triton._C.libtriton.ir.context object at 0x7fad074d42b0>
2025-05-07T20:33:41.6618087Z 
2025-05-07T20:33:41.6618407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.6619438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.6620354Z                            module_map=module_map)
2025-05-07T20:33:41.6621054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.6621732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.6622241Z E       ^
2025-05-07T20:33:41.6623148Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.6624380Z 
2025-05-07T20:33:41.6625216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.6626237Z 
2025-05-07T20:33:41.6626453Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.6627263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.6628047Z     T=2048,
2025-05-07T20:33:41.6628406Z     D=5120,
2025-05-07T20:33:41.6628772Z     scale_ub=1200.0,
2025-05-07T20:33:41.6629202Z     contiguous=False,
2025-05-07T20:33:41.6629630Z     compiled=True,
2025-05-07T20:33:41.6630022Z )
2025-05-07T20:33:41.6630631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.6631605Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:41.6632157Z 
2025-05-07T20:33:41.6632317Z     @given(
2025-05-07T20:33:41.6632749Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.6633488Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.6634201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.6634848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.6635486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.6636055Z     )
2025-05-07T20:33:41.6636743Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.6637609Z     def test_silu_mul_quant(
2025-05-07T20:33:41.6638053Z         self,
2025-05-07T20:33:41.6638379Z         T: int,
2025-05-07T20:33:41.6638678Z         D: int,
2025-05-07T20:33:41.6639021Z         scale_ub: Optional[float],
2025-05-07T20:33:41.6639448Z         contiguous: bool,
2025-05-07T20:33:41.6639815Z         compiled: bool,
2025-05-07T20:33:41.6640310Z     ) -> None:
2025-05-07T20:33:41.6640654Z         torch.manual_seed(2025)
2025-05-07T20:33:41.6641032Z     
2025-05-07T20:33:41.6641463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.6641993Z     
2025-05-07T20:33:41.6642288Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.6642752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.6643431Z         x = x_sign * x_clamp
2025-05-07T20:33:41.6643838Z         x0 = x[:, :D]
2025-05-07T20:33:41.6644304Z         x1 = x[:, D:]
2025-05-07T20:33:41.6644675Z     
2025-05-07T20:33:41.6644994Z         if contiguous:
2025-05-07T20:33:41.6645389Z             x0 = x0.contiguous()
2025-05-07T20:33:41.6645839Z             x1 = x1.contiguous()
2025-05-07T20:33:41.6646247Z     
2025-05-07T20:33:41.6646557Z         if scale_ub is not None:
2025-05-07T20:33:41.6647006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.6647571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.6648100Z             )
2025-05-07T20:33:41.6648447Z         else:
2025-05-07T20:33:41.6648813Z             scale_ub_tensor = None
2025-05-07T20:33:41.6649233Z     
2025-05-07T20:33:41.6649640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.6650175Z             op = silu_mul_quant
2025-05-07T20:33:41.6650608Z             if compiled:
2025-05-07T20:33:41.6651080Z                 op = torch.compile(op)
2025-05-07T20:33:41.6651589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.6652074Z     
2025-05-07T20:33:41.6652423Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:41.6652731Z 
2025-05-07T20:33:41.6652914Z moe/activation_test.py:117: 
2025-05-07T20:33:41.6653456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.6654059Z moe/activation_test.py:115: in fn
2025-05-07T20:33:41.6654576Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.6655606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:41.6656634Z     return fn(*args, **kwargs)
2025-05-07T20:33:41.6657860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:41.6659135Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:41.6660121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.6661372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.6662602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.6663581Z     kernel = self.compile(
2025-05-07T20:33:41.6664529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.6665733Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.6666463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.6666890Z 
2025-05-07T20:33:41.6667395Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07487550>
2025-05-07T20:33:41.6669505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.6672167Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07454820>}
2025-05-07T20:33:41.6674819Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.6677005Z context = <triton._C.libtriton.ir.context object at 0x7fad071febf0>
2025-05-07T20:33:41.6677574Z 
2025-05-07T20:33:41.6677907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.6678930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.6679868Z                            module_map=module_map)
2025-05-07T20:33:41.6680734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.6681463Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:41.6681972Z E       ^
2025-05-07T20:33:41.6682896Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.6683808Z 
2025-05-07T20:33:41.6684647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.6685677Z 
2025-05-07T20:33:42.0512894Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.0513918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.0514680Z     T=4096,
2025-05-07T20:33:42.0515029Z     D=5120,
2025-05-07T20:33:42.0515395Z     scale_ub=1200.0,
2025-05-07T20:33:42.0515816Z     contiguous=True,
2025-05-07T20:33:42.0516210Z     compiled=True,
2025-05-07T20:33:42.0516594Z )
2025-05-07T20:33:42.0517181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.0518123Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:42.0518625Z 
2025-05-07T20:33:42.0518757Z     @given(
2025-05-07T20:33:42.0519126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.0519637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.0520176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.0520875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.0521556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.0522056Z     )
2025-05-07T20:33:42.0522681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.0523458Z     def test_silu_mul_quant(
2025-05-07T20:33:42.0524106Z         self,
2025-05-07T20:33:42.0524448Z         T: int,
2025-05-07T20:33:42.0524786Z         D: int,
2025-05-07T20:33:42.0525144Z         scale_ub: Optional[float],
2025-05-07T20:33:42.0525635Z         contiguous: bool,
2025-05-07T20:33:42.0526060Z         compiled: bool,
2025-05-07T20:33:42.0526449Z     ) -> None:
2025-05-07T20:33:42.0526829Z         torch.manual_seed(2025)
2025-05-07T20:33:42.0527262Z     
2025-05-07T20:33:42.0527782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.0528411Z     
2025-05-07T20:33:42.0528760Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.0529290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.0529868Z         x = x_sign * x_clamp
2025-05-07T20:33:42.0530331Z         x0 = x[:, :D]
2025-05-07T20:33:42.0530731Z         x1 = x[:, D:]
2025-05-07T20:33:42.0531112Z     
2025-05-07T20:33:42.0531469Z         if contiguous:
2025-05-07T20:33:42.0532226Z             x0 = x0.contiguous()
2025-05-07T20:33:42.0532728Z             x1 = x1.contiguous()
2025-05-07T20:33:42.0533192Z     
2025-05-07T20:33:42.0533561Z         if scale_ub is not None:
2025-05-07T20:33:42.0534077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.0534720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.0535312Z             )
2025-05-07T20:33:42.0535673Z         else:
2025-05-07T20:33:42.0536073Z             scale_ub_tensor = None
2025-05-07T20:33:42.0536557Z     
2025-05-07T20:33:42.0536992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.0537605Z             op = silu_mul_quant
2025-05-07T20:33:42.0538085Z             if compiled:
2025-05-07T20:33:42.0538561Z                 op = torch.compile(op)
2025-05-07T20:33:42.0539286Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.0539817Z     
2025-05-07T20:33:42.0540186Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.0540504Z 
2025-05-07T20:33:42.0540698Z moe/activation_test.py:117: 
2025-05-07T20:33:42.0541263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.0541906Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.0542561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.0543743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.0544823Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.0546105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.0547457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.0548503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.0549856Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.0551162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.0552230Z     kernel = self.compile(
2025-05-07T20:33:42.0553252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.0554590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.0555316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.0555712Z 
2025-05-07T20:33:42.0556097Z self = <triton.compiler.compiler.ASTSource object at 0x7fad074a4040>
2025-05-07T20:33:42.0558145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.0560829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07455360>}
2025-05-07T20:33:42.0563454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.0565466Z context = <triton._C.libtriton.ir.context object at 0x7fad0715d7b0>
2025-05-07T20:33:42.0566029Z 
2025-05-07T20:33:42.0566345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.0567337Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.0568232Z                            module_map=module_map)
2025-05-07T20:33:42.0568921Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.0569596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.0570092Z E       ^
2025-05-07T20:33:42.0571085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.0571979Z 
2025-05-07T20:33:42.0572791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.0573787Z 
2025-05-07T20:33:42.0574000Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.0574786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.0575542Z     T=128,
2025-05-07T20:33:42.0575899Z     D=5120,
2025-05-07T20:33:42.0576261Z     scale_ub=1200.0,
2025-05-07T20:33:42.0576675Z     contiguous=False,
2025-05-07T20:33:42.0577101Z     compiled=True,
2025-05-07T20:33:42.0577487Z )
2025-05-07T20:33:42.1818258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.1819130Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:42.1819585Z 
2025-05-07T20:33:42.1819691Z     @given(
2025-05-07T20:33:42.1819966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.1820604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.1821219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.1822038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.1822813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.1823395Z     )
2025-05-07T20:33:42.1824423Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.1825310Z     def test_silu_mul_quant(
2025-05-07T20:33:42.1825796Z         self,
2025-05-07T20:33:42.1826182Z         T: int,
2025-05-07T20:33:42.1826579Z         D: int,
2025-05-07T20:33:42.1827015Z         scale_ub: Optional[float],
2025-05-07T20:33:42.1827552Z         contiguous: bool,
2025-05-07T20:33:42.1828047Z         compiled: bool,
2025-05-07T20:33:42.1828500Z     ) -> None:
2025-05-07T20:33:42.1828935Z         torch.manual_seed(2025)
2025-05-07T20:33:42.1829416Z     
2025-05-07T20:33:42.1829968Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.1830654Z     
2025-05-07T20:33:42.1830954Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.1831335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.1831710Z         x = x_sign * x_clamp
2025-05-07T20:33:42.1831990Z         x0 = x[:, :D]
2025-05-07T20:33:42.1832245Z         x1 = x[:, D:]
2025-05-07T20:33:42.1832486Z     
2025-05-07T20:33:42.1832698Z         if contiguous:
2025-05-07T20:33:42.1832970Z             x0 = x0.contiguous()
2025-05-07T20:33:42.1833273Z             x1 = x1.contiguous()
2025-05-07T20:33:42.1833609Z     
2025-05-07T20:33:42.1833837Z         if scale_ub is not None:
2025-05-07T20:33:42.1834157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.1834543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.1834905Z             )
2025-05-07T20:33:42.1835134Z         else:
2025-05-07T20:33:42.1835383Z             scale_ub_tensor = None
2025-05-07T20:33:42.1835679Z     
2025-05-07T20:33:42.1835956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.1836321Z             op = silu_mul_quant
2025-05-07T20:33:42.1836612Z             if compiled:
2025-05-07T20:33:42.1836904Z                 op = torch.compile(op)
2025-05-07T20:33:42.1837253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1837574Z     
2025-05-07T20:33:42.1837831Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.1838029Z 
2025-05-07T20:33:42.1838146Z moe/activation_test.py:117: 
2025-05-07T20:33:42.1838494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1838878Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.1839212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1839871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.1840614Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.1841378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.1842178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.1842807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.1843586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.1844351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.1844968Z     kernel = self.compile(
2025-05-07T20:33:42.1845598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.1846422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.1846889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1847156Z 
2025-05-07T20:33:42.1847403Z self = <triton.compiler.compiler.ASTSource object at 0x7fad071e0580>
2025-05-07T20:33:42.1848713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.1850367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07456290>}
2025-05-07T20:33:42.1851963Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.1853147Z context = <triton._C.libtriton.ir.context object at 0x7fad070dfc30>
2025-05-07T20:33:42.1853481Z 
2025-05-07T20:33:42.1853685Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.1854282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.1854832Z                            module_map=module_map)
2025-05-07T20:33:42.1855259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.1855670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.1855965Z E       ^
2025-05-07T20:33:42.1856506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.1857028Z 
2025-05-07T20:33:42.1857515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.1858106Z 
2025-05-07T20:33:42.1858226Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.1858707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.1859172Z     T=16384,
2025-05-07T20:33:42.1859399Z     D=7168,
2025-05-07T20:33:42.1859636Z     scale_ub=1200.0,
2025-05-07T20:33:42.1859890Z     contiguous=True,
2025-05-07T20:33:42.1860147Z     compiled=True,
2025-05-07T20:33:42.1866836Z )
2025-05-07T20:33:42.1867243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.1867824Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:42.1868145Z 
2025-05-07T20:33:42.1868247Z     @given(
2025-05-07T20:33:42.1868519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.1868890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.1869250Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.1869629Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.1870013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.1870352Z     )
2025-05-07T20:33:42.1870836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.1871402Z     def test_silu_mul_quant(
2025-05-07T20:33:42.1871690Z         self,
2025-05-07T20:33:42.1871915Z         T: int,
2025-05-07T20:33:42.1872154Z         D: int,
2025-05-07T20:33:42.1872419Z         scale_ub: Optional[float],
2025-05-07T20:33:42.1872744Z         contiguous: bool,
2025-05-07T20:33:42.1873025Z         compiled: bool,
2025-05-07T20:33:42.1873290Z     ) -> None:
2025-05-07T20:33:42.1873620Z         torch.manual_seed(2025)
2025-05-07T20:33:42.1873901Z     
2025-05-07T20:33:42.1874221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.1874618Z     
2025-05-07T20:33:42.1874841Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.1875184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.1875607Z         x = x_sign * x_clamp
2025-05-07T20:33:42.1875885Z         x0 = x[:, :D]
2025-05-07T20:33:42.1876144Z         x1 = x[:, D:]
2025-05-07T20:33:42.1876389Z     
2025-05-07T20:33:42.1876611Z         if contiguous:
2025-05-07T20:33:42.1876884Z             x0 = x0.contiguous()
2025-05-07T20:33:42.1877186Z             x1 = x1.contiguous()
2025-05-07T20:33:42.1877458Z     
2025-05-07T20:33:42.1877743Z         if scale_ub is not None:
2025-05-07T20:33:42.1878112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.1878502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.1878865Z             )
2025-05-07T20:33:42.1879094Z         else:
2025-05-07T20:33:42.1879342Z             scale_ub_tensor = None
2025-05-07T20:33:42.1879631Z     
2025-05-07T20:33:42.1879908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.1880283Z             op = silu_mul_quant
2025-05-07T20:33:42.1880570Z             if compiled:
2025-05-07T20:33:42.1880865Z                 op = torch.compile(op)
2025-05-07T20:33:42.1881218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1881533Z     
2025-05-07T20:33:42.1881763Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.1881953Z 
2025-05-07T20:33:42.1882071Z moe/activation_test.py:117: 
2025-05-07T20:33:42.1882420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1882808Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.1883136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1883781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.1884428Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.1885188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.1885972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.1886591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.1887374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.1888133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.1888740Z     kernel = self.compile(
2025-05-07T20:33:42.1889370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.1890123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.1890580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1890851Z 
2025-05-07T20:33:42.1891090Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07187f10>
2025-05-07T20:33:42.1892330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.1893957Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07456d40>}
2025-05-07T20:33:42.1895490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.1896655Z context = <triton._C.libtriton.ir.context object at 0x7fad070f29f0>
2025-05-07T20:33:42.1896994Z 
2025-05-07T20:33:42.1897186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.1897788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.1898329Z                            module_map=module_map)
2025-05-07T20:33:42.1898810Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.1899216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.1899515Z E       ^
2025-05-07T20:33:42.1900047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.1900568Z 
2025-05-07T20:33:42.1901125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.1901766Z 
2025-05-07T20:33:42.3390760Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.3392249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.3393141Z     T=16384,
2025-05-07T20:33:42.3393654Z     D=5120,
2025-05-07T20:33:42.3394027Z     scale_ub=1200.0,
2025-05-07T20:33:42.3394435Z     contiguous=True,
2025-05-07T20:33:42.3394847Z     compiled=False,
2025-05-07T20:33:42.3395200Z )
2025-05-07T20:33:42.3395823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.3396797Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:42.3397368Z 
2025-05-07T20:33:42.3397527Z     @given(
2025-05-07T20:33:42.3397964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.3398582Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.3399198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.3399845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.3400493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.3401057Z     )
2025-05-07T20:33:42.3401737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.3402634Z     def test_silu_mul_quant(
2025-05-07T20:33:42.3403112Z         self,
2025-05-07T20:33:42.3403498Z         T: int,
2025-05-07T20:33:42.3403871Z         D: int,
2025-05-07T20:33:42.3404295Z         scale_ub: Optional[float],
2025-05-07T20:33:42.3404829Z         contiguous: bool,
2025-05-07T20:33:42.3405285Z         compiled: bool,
2025-05-07T20:33:42.3405723Z     ) -> None:
2025-05-07T20:33:42.3406147Z         torch.manual_seed(2025)
2025-05-07T20:33:42.3406612Z     
2025-05-07T20:33:42.3407143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.3407816Z     
2025-05-07T20:33:42.3408181Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.3408748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.3409359Z         x = x_sign * x_clamp
2025-05-07T20:33:42.3409818Z         x0 = x[:, :D]
2025-05-07T20:33:42.3410256Z         x1 = x[:, D:]
2025-05-07T20:33:42.3410667Z     
2025-05-07T20:33:42.3411012Z         if contiguous:
2025-05-07T20:33:42.3411457Z             x0 = x0.contiguous()
2025-05-07T20:33:42.3411959Z             x1 = x1.contiguous()
2025-05-07T20:33:42.3412426Z     
2025-05-07T20:33:42.3412784Z         if scale_ub is not None:
2025-05-07T20:33:42.3413318Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.3413955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.3414545Z             )
2025-05-07T20:33:42.3415247Z         else:
2025-05-07T20:33:42.3415670Z             scale_ub_tensor = None
2025-05-07T20:33:42.3416150Z     
2025-05-07T20:33:42.3416589Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.3417205Z             op = silu_mul_quant
2025-05-07T20:33:42.3417680Z             if compiled:
2025-05-07T20:33:42.3418155Z                 op = torch.compile(op)
2025-05-07T20:33:42.3418732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.3419231Z     
2025-05-07T20:33:42.3419589Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.3419900Z 
2025-05-07T20:33:42.3420092Z moe/activation_test.py:117: 
2025-05-07T20:33:42.3420652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.3421286Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.3421974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.3423305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.3424882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.3425914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.3427555Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.3428905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.3429966Z     kernel = self.compile(
2025-05-07T20:33:42.3431045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.3432355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.3433136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.3433709Z 
2025-05-07T20:33:42.3434118Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07184d90>
2025-05-07T20:33:42.3436305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.3439097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07457ac0>}
2025-05-07T20:33:42.3441832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.3443885Z context = <triton._C.libtriton.ir.context object at 0x7fad070e3e30>
2025-05-07T20:33:42.3444474Z 
2025-05-07T20:33:42.3444795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.3445840Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.3446771Z                            module_map=module_map)
2025-05-07T20:33:42.3447481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.3448173Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.3448680Z E       ^
2025-05-07T20:33:42.3449598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.3450569Z 
2025-05-07T20:33:42.3451400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.3452446Z 
2025-05-07T20:33:42.3452645Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.3453457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.3454250Z     T=1,
2025-05-07T20:33:42.3454600Z     D=7168,
2025-05-07T20:33:42.3454974Z     scale_ub=1200.0,
2025-05-07T20:33:42.3455523Z     contiguous=False,
2025-05-07T20:33:42.3455966Z     compiled=False,
2025-05-07T20:33:42.3456369Z )
2025-05-07T20:33:42.3456976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.3457948Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:42.3458483Z 
2025-05-07T20:33:42.3458639Z     @given(
2025-05-07T20:33:42.3459069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.3459681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.3460273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.3460922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.3461546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.3462193Z     )
2025-05-07T20:33:42.3462751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.3463437Z     def test_silu_mul_quant(
2025-05-07T20:33:42.3463831Z         self,
2025-05-07T20:33:42.3464151Z         T: int,
2025-05-07T20:33:42.3464464Z         D: int,
2025-05-07T20:33:42.3464822Z         scale_ub: Optional[float],
2025-05-07T20:33:42.3465256Z         contiguous: bool,
2025-05-07T20:33:42.3465733Z         compiled: bool,
2025-05-07T20:33:42.3466154Z     ) -> None:
2025-05-07T20:33:42.3466537Z         torch.manual_seed(2025)
2025-05-07T20:33:42.3466954Z     
2025-05-07T20:33:42.3467426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.3467995Z     
2025-05-07T20:33:42.3468309Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.3468808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.3469345Z         x = x_sign * x_clamp
2025-05-07T20:33:42.3469763Z         x0 = x[:, :D]
2025-05-07T20:33:42.3470132Z         x1 = x[:, D:]
2025-05-07T20:33:42.3470496Z     
2025-05-07T20:33:42.3470820Z         if contiguous:
2025-05-07T20:33:42.3471216Z             x0 = x0.contiguous()
2025-05-07T20:33:42.3471676Z             x1 = x1.contiguous()
2025-05-07T20:33:42.3472093Z     
2025-05-07T20:33:42.3472412Z         if scale_ub is not None:
2025-05-07T20:33:42.3472871Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.3473468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.3474086Z             )
2025-05-07T20:33:42.3474433Z         else:
2025-05-07T20:33:42.3474806Z             scale_ub_tensor = None
2025-05-07T20:33:42.3475282Z     
2025-05-07T20:33:42.3475717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.3476307Z             op = silu_mul_quant
2025-05-07T20:33:42.3476769Z             if compiled:
2025-05-07T20:33:42.3477228Z                 op = torch.compile(op)
2025-05-07T20:33:42.3477779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.3478300Z     
2025-05-07T20:33:42.3478651Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.3478967Z 
2025-05-07T20:33:42.3479152Z moe/activation_test.py:117: 
2025-05-07T20:33:42.3479709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.3480317Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.3480892Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.3482191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.3483479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.3484476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.3485751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.3486989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.3487982Z     kernel = self.compile(
2025-05-07T20:33:42.3489136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.3490387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.3491084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.3491498Z 
2025-05-07T20:33:42.3491883Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06fc5870>
2025-05-07T20:33:42.3493876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.3496582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa8550>}
2025-05-07T20:33:42.3499271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.3501262Z context = <triton._C.libtriton.ir.context object at 0x7fad06f238b0>
2025-05-07T20:33:42.3501843Z 
2025-05-07T20:33:42.3502249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.3503345Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.3504284Z                            module_map=module_map)
2025-05-07T20:33:42.3504980Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.3505674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.3506185Z E       ^
2025-05-07T20:33:42.3507096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.3508020Z 
2025-05-07T20:33:42.3508853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.3509895Z 
2025-05-07T20:33:42.5601845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.5602741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.5603541Z     T=4096,
2025-05-07T20:33:42.5603896Z     D=7168,
2025-05-07T20:33:42.5604270Z     scale_ub=1200.0,
2025-05-07T20:33:42.5604689Z     contiguous=False,
2025-05-07T20:33:42.5605121Z     compiled=True,
2025-05-07T20:33:42.5605524Z )
2025-05-07T20:33:42.5606129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.5607066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:42.5607539Z 
2025-05-07T20:33:42.5607670Z     @given(
2025-05-07T20:33:42.5608055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.5608589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.5609108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.5609690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.5610290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.5610835Z     )
2025-05-07T20:33:42.5611459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.5612227Z     def test_silu_mul_quant(
2025-05-07T20:33:42.5612654Z         self,
2025-05-07T20:33:42.5612998Z         T: int,
2025-05-07T20:33:42.5613361Z         D: int,
2025-05-07T20:33:42.5613760Z         scale_ub: Optional[float],
2025-05-07T20:33:42.5614257Z         contiguous: bool,
2025-05-07T20:33:42.5614697Z         compiled: bool,
2025-05-07T20:33:42.5615105Z     ) -> None:
2025-05-07T20:33:42.5615499Z         torch.manual_seed(2025)
2025-05-07T20:33:42.5615975Z     
2025-05-07T20:33:42.5616490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.5617127Z     
2025-05-07T20:33:42.5617486Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.5618379Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.5618973Z         x = x_sign * x_clamp
2025-05-07T20:33:42.5619430Z         x0 = x[:, :D]
2025-05-07T20:33:42.5619846Z         x1 = x[:, D:]
2025-05-07T20:33:42.5620237Z     
2025-05-07T20:33:42.5620645Z         if contiguous:
2025-05-07T20:33:42.5621128Z             x0 = x0.contiguous()
2025-05-07T20:33:42.5621628Z             x1 = x1.contiguous()
2025-05-07T20:33:42.5622107Z     
2025-05-07T20:33:42.5622490Z         if scale_ub is not None:
2025-05-07T20:33:42.5623015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.5623668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.5624685Z             )
2025-05-07T20:33:42.5625065Z         else:
2025-05-07T20:33:42.5625469Z             scale_ub_tensor = None
2025-05-07T20:33:42.5626150Z     
2025-05-07T20:33:42.5626611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.5627233Z             op = silu_mul_quant
2025-05-07T20:33:42.5627743Z             if compiled:
2025-05-07T20:33:42.5628224Z                 op = torch.compile(op)
2025-05-07T20:33:42.5628805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.5629355Z     
2025-05-07T20:33:42.5629727Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.5630208Z 
2025-05-07T20:33:42.5630513Z moe/activation_test.py:117: 
2025-05-07T20:33:42.5631102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.5631777Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.5632330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.5633420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.5634646Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.5635983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.5637379Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.5638471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.5639845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.5641225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.5642259Z     kernel = self.compile(
2025-05-07T20:33:42.5643322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.5644546Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.5645328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.5645763Z 
2025-05-07T20:33:42.5646160Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06fc6f80>
2025-05-07T20:33:42.5648277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.5651055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa8f70>}
2025-05-07T20:33:42.5653759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.5655776Z context = <triton._C.libtriton.ir.context object at 0x7fad06f00c30>
2025-05-07T20:33:42.5656331Z 
2025-05-07T20:33:42.5656639Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.5657669Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.5658738Z                            module_map=module_map)
2025-05-07T20:33:42.5659445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.5660128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.5660638Z E       ^
2025-05-07T20:33:42.5661562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.5662478Z 
2025-05-07T20:33:42.5663290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.5664322Z 
2025-05-07T20:33:42.5664525Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.5665334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.5666113Z     T=128,
2025-05-07T20:33:42.5666574Z     D=7168,
2025-05-07T20:33:42.5666952Z     scale_ub=1200.0,
2025-05-07T20:33:42.5667372Z     contiguous=False,
2025-05-07T20:33:42.5667813Z     compiled=True,
2025-05-07T20:33:42.5668225Z )
2025-05-07T20:33:42.6808813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.6809931Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:42.6810843Z 
2025-05-07T20:33:42.6810999Z     @given(
2025-05-07T20:33:42.6811550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.6812135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.6812657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.6813280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.6813927Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.6814501Z     )
2025-05-07T20:33:42.6815184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.6816074Z     def test_silu_mul_quant(
2025-05-07T20:33:42.6816545Z         self,
2025-05-07T20:33:42.6816913Z         T: int,
2025-05-07T20:33:42.6817297Z         D: int,
2025-05-07T20:33:42.6817722Z         scale_ub: Optional[float],
2025-05-07T20:33:42.6818246Z         contiguous: bool,
2025-05-07T20:33:42.6818713Z         compiled: bool,
2025-05-07T20:33:42.6819151Z     ) -> None:
2025-05-07T20:33:42.6819566Z         torch.manual_seed(2025)
2025-05-07T20:33:42.6820051Z     
2025-05-07T20:33:42.6820622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.6821330Z     
2025-05-07T20:33:42.6821701Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.6822268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.6822881Z         x = x_sign * x_clamp
2025-05-07T20:33:42.6823337Z         x0 = x[:, :D]
2025-05-07T20:33:42.6824004Z         x1 = x[:, D:]
2025-05-07T20:33:42.6824419Z     
2025-05-07T20:33:42.6824768Z         if contiguous:
2025-05-07T20:33:42.6825228Z             x0 = x0.contiguous()
2025-05-07T20:33:42.6825739Z             x1 = x1.contiguous()
2025-05-07T20:33:42.6826201Z     
2025-05-07T20:33:42.6826581Z         if scale_ub is not None:
2025-05-07T20:33:42.6827120Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.6827771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.6828380Z             )
2025-05-07T20:33:42.6828760Z         else:
2025-05-07T20:33:42.6829164Z             scale_ub_tensor = None
2025-05-07T20:33:42.6829662Z     
2025-05-07T20:33:42.6830108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.6830805Z             op = silu_mul_quant
2025-05-07T20:33:42.6831295Z             if compiled:
2025-05-07T20:33:42.6843067Z                 op = torch.compile(op)
2025-05-07T20:33:42.6843722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.6844284Z     
2025-05-07T20:33:42.6844659Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.6845017Z 
2025-05-07T20:33:42.6845216Z moe/activation_test.py:117: 
﻿2025-05-07T20:33:42.6850122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.6850964Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.6851539Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.6852691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.6853840Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.6855204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.6856629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.6857667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.6859031Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.6860389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.6861555Z     kernel = self.compile(
2025-05-07T20:33:42.6862655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.6863989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.6864919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.6865488Z 
2025-05-07T20:33:42.6865895Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06ee5f60>
2025-05-07T20:33:42.6868097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.6871024Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa92d0>}
2025-05-07T20:33:42.6873957Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.6876088Z context = <triton._C.libtriton.ir.context object at 0x7fad06ebafb0>
2025-05-07T20:33:42.6876694Z 
2025-05-07T20:33:42.6877029Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.6878090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.6879033Z                            module_map=module_map)
2025-05-07T20:33:42.6879707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.6880268Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.6880717Z E       ^
2025-05-07T20:33:42.6881451Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.6882194Z 
2025-05-07T20:33:42.6882866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.6883697Z 
2025-05-07T20:33:42.6883889Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.6884593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.6885283Z     T=2048,
2025-05-07T20:33:42.6885626Z     D=7168,
2025-05-07T20:33:42.6885955Z     scale_ub=None,
2025-05-07T20:33:42.6886315Z     contiguous=True,
2025-05-07T20:33:42.6886726Z     compiled=True,
2025-05-07T20:33:42.6887093Z )
2025-05-07T20:33:42.6887651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.6888491Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:42.6888954Z 
2025-05-07T20:33:42.6889114Z     @given(
2025-05-07T20:33:42.6889522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.6890255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.6890872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.6891473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.6892029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.6892498Z     )
2025-05-07T20:33:42.6893101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.6893797Z     def test_silu_mul_quant(
2025-05-07T20:33:42.6894207Z         self,
2025-05-07T20:33:42.6894534Z         T: int,
2025-05-07T20:33:42.6894856Z         D: int,
2025-05-07T20:33:42.6895222Z         scale_ub: Optional[float],
2025-05-07T20:33:42.6895687Z         contiguous: bool,
2025-05-07T20:33:42.6896092Z         compiled: bool,
2025-05-07T20:33:42.6896481Z     ) -> None:
2025-05-07T20:33:42.6896836Z         torch.manual_seed(2025)
2025-05-07T20:33:42.6897240Z     
2025-05-07T20:33:42.6897680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.6898245Z     
2025-05-07T20:33:42.6898570Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.6899043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.6899554Z         x = x_sign * x_clamp
2025-05-07T20:33:42.6899954Z         x0 = x[:, :D]
2025-05-07T20:33:42.6900395Z         x1 = x[:, D:]
2025-05-07T20:33:42.6900789Z     
2025-05-07T20:33:42.6901154Z         if contiguous:
2025-05-07T20:33:42.6901532Z             x0 = x0.contiguous()
2025-05-07T20:33:42.6901962Z             x1 = x1.contiguous()
2025-05-07T20:33:42.6902366Z     
2025-05-07T20:33:42.6902683Z         if scale_ub is not None:
2025-05-07T20:33:42.6903138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.6903683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.6904215Z             )
2025-05-07T20:33:42.6904551Z         else:
2025-05-07T20:33:42.6904909Z             scale_ub_tensor = None
2025-05-07T20:33:42.6905337Z     
2025-05-07T20:33:42.6905715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.6906245Z             op = silu_mul_quant
2025-05-07T20:33:42.6906657Z             if compiled:
2025-05-07T20:33:42.6907064Z                 op = torch.compile(op)
2025-05-07T20:33:42.6907557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.6908013Z     
2025-05-07T20:33:42.6908323Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.6908605Z 
2025-05-07T20:33:42.6908768Z moe/activation_test.py:117: 
2025-05-07T20:33:42.6909252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.6909791Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.6910251Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.6911193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:42.6912119Z     return fn(*args, **kwargs)
2025-05-07T20:33:42.6913199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.6914447Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.6915329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.6916446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.6917553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.6918435Z     kernel = self.compile(
2025-05-07T20:33:42.6919326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.6920406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.6921072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.6921451Z 
2025-05-07T20:33:42.6921802Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06e7c790>
2025-05-07T20:33:42.6924089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.6926409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06faa560>}
2025-05-07T20:33:42.6928643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.6930330Z context = <triton._C.libtriton.ir.context object at 0x7fad06ef59f0>
2025-05-07T20:33:42.6930855Z 
2025-05-07T20:33:42.6931142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.6932006Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.6932787Z                            module_map=module_map)
2025-05-07T20:33:42.6933398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.6934003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.6934577Z E       ^
2025-05-07T20:33:42.6935436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.6936202Z 
2025-05-07T20:33:42.6936898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.6937740Z 
2025-05-07T20:33:42.7794013Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.7794896Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.7795668Z     T=16384,
2025-05-07T20:33:42.7796032Z     D=5120,
2025-05-07T20:33:42.7796422Z     scale_ub=None,
2025-05-07T20:33:42.7796836Z     contiguous=False,
2025-05-07T20:33:42.7797269Z     compiled=False,
2025-05-07T20:33:42.7797660Z )
2025-05-07T20:33:42.7798266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.7799245Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:42.7799806Z 
2025-05-07T20:33:42.7799964Z     @given(
2025-05-07T20:33:42.7800407Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.7801001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.7801593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.7802233Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.7802873Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.7803422Z     )
2025-05-07T20:33:42.7804114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.7804992Z     def test_silu_mul_quant(
2025-05-07T20:33:42.7805458Z         self,
2025-05-07T20:33:42.7805831Z         T: int,
2025-05-07T20:33:42.7806217Z         D: int,
2025-05-07T20:33:42.7806625Z         scale_ub: Optional[float],
2025-05-07T20:33:42.7807152Z         contiguous: bool,
2025-05-07T20:33:42.7807624Z         compiled: bool,
2025-05-07T20:33:42.7808061Z     ) -> None:
2025-05-07T20:33:42.7808487Z         torch.manual_seed(2025)
2025-05-07T20:33:42.7808967Z     
2025-05-07T20:33:42.7809492Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.7810174Z     
2025-05-07T20:33:42.7810564Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.7811130Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.7815470Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:42.7819416Z 
2025-05-07T20:33:42.7819651Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:42.7820076Z 
2025-05-07T20:33:42.7820275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.7821143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.7821926Z     T=4096,
2025-05-07T20:33:42.7822292Z     D=7168,
2025-05-07T20:33:42.7822665Z     scale_ub=1200.0,
2025-05-07T20:33:42.7823089Z     contiguous=True,
2025-05-07T20:33:42.7823518Z     compiled=True,
2025-05-07T20:33:42.7824277Z )
2025-05-07T20:33:42.7824885Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.7825852Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:42.7826376Z 
2025-05-07T20:33:42.7826523Z     @given(
2025-05-07T20:33:42.7826946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.7827535Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.7828339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.7829104Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.7829730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.7830373Z     )
2025-05-07T20:33:42.7831164Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.7832228Z     def test_silu_mul_quant(
2025-05-07T20:33:42.7832737Z         self,
2025-05-07T20:33:42.7833108Z         T: int,
2025-05-07T20:33:42.7833480Z         D: int,
2025-05-07T20:33:42.7833978Z         scale_ub: Optional[float],
2025-05-07T20:33:42.7834501Z         contiguous: bool,
2025-05-07T20:33:42.7834975Z         compiled: bool,
2025-05-07T20:33:42.7835388Z     ) -> None:
2025-05-07T20:33:42.7835804Z         torch.manual_seed(2025)
2025-05-07T20:33:42.7836274Z     
2025-05-07T20:33:42.7836790Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.7837464Z     
2025-05-07T20:33:42.7837844Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.7838390Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.7842392Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:42.7846147Z 
2025-05-07T20:33:42.7846386Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:42.7846803Z 
2025-05-07T20:33:42.7847001Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.7847804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.7848580Z     T=16384,
2025-05-07T20:33:42.7848945Z     D=7168,
2025-05-07T20:33:42.7849342Z     scale_ub=None,
2025-05-07T20:33:42.7849742Z     contiguous=False,
2025-05-07T20:33:42.7850178Z     compiled=False,
2025-05-07T20:33:42.7850624Z )
2025-05-07T20:33:42.7851227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.7852183Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:42.7852721Z 
2025-05-07T20:33:42.7852882Z     @given(
2025-05-07T20:33:42.7853315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.7853912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.7854668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.7855409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.7856043Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.7856604Z     )
2025-05-07T20:33:42.7857299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.7858168Z     def test_silu_mul_quant(
2025-05-07T20:33:42.7858640Z         self,
2025-05-07T20:33:42.7859001Z         T: int,
2025-05-07T20:33:42.7859374Z         D: int,
2025-05-07T20:33:42.7859782Z         scale_ub: Optional[float],
2025-05-07T20:33:42.7860305Z         contiguous: bool,
2025-05-07T20:33:42.7860801Z         compiled: bool,
2025-05-07T20:33:42.7861231Z     ) -> None:
2025-05-07T20:33:42.7861637Z         torch.manual_seed(2025)
2025-05-07T20:33:42.7862101Z     
2025-05-07T20:33:42.7862614Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.7866833Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:42.7870879Z 
2025-05-07T20:33:42.7871170Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:42.7871687Z 
2025-05-07T20:33:42.7871900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.7872676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.7873462Z     T=2048,
2025-05-07T20:33:42.7873915Z     D=7168,
2025-05-07T20:33:42.7874274Z     scale_ub=1200.0,
2025-05-07T20:33:42.7874711Z     contiguous=True,
2025-05-07T20:33:42.7875138Z     compiled=True,
2025-05-07T20:33:42.7875538Z )
2025-05-07T20:33:42.7876144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.7877099Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:42.7877631Z 
2025-05-07T20:33:42.7877787Z     @given(
2025-05-07T20:33:42.7878217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.7878828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.7879423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.7880052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.7880715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.7881296Z     )
2025-05-07T20:33:42.7881968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.7882833Z     def test_silu_mul_quant(
2025-05-07T20:33:42.7883293Z         self,
2025-05-07T20:33:42.7883669Z         T: int,
2025-05-07T20:33:42.7884040Z         D: int,
2025-05-07T20:33:42.7884448Z         scale_ub: Optional[float],
2025-05-07T20:33:42.7884972Z         contiguous: bool,
2025-05-07T20:33:42.7885428Z         compiled: bool,
2025-05-07T20:33:42.7885858Z     ) -> None:
2025-05-07T20:33:42.7886277Z         torch.manual_seed(2025)
2025-05-07T20:33:42.7886757Z     
2025-05-07T20:33:42.7887293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.7887933Z     
2025-05-07T20:33:42.7888268Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.7888815Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.7893512Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:42.7897397Z 
2025-05-07T20:33:42.7897632Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:42.7898059Z 
2025-05-07T20:33:42.7898267Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.7899065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.7899860Z     T=2048,
2025-05-07T20:33:42.7900220Z     D=7168,
2025-05-07T20:33:42.7900594Z     scale_ub=None,
2025-05-07T20:33:42.7901040Z     contiguous=True,
2025-05-07T20:33:42.7901468Z     compiled=False,
2025-05-07T20:33:42.7901855Z )
2025-05-07T20:33:43.1073406Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.1074549Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.1075099Z 
2025-05-07T20:33:43.1075264Z     @given(
2025-05-07T20:33:43.1075691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.1076296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.1076874Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.1078023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.1078628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.1079139Z     )
2025-05-07T20:33:43.1079782Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.1080633Z     def test_silu_mul_quant(
2025-05-07T20:33:43.1081087Z         self,
2025-05-07T20:33:43.1081451Z         T: int,
2025-05-07T20:33:43.1081813Z         D: int,
2025-05-07T20:33:43.1082223Z         scale_ub: Optional[float],
2025-05-07T20:33:43.1082759Z         contiguous: bool,
2025-05-07T20:33:43.1083208Z         compiled: bool,
2025-05-07T20:33:43.1083638Z     ) -> None:
2025-05-07T20:33:43.1084041Z         torch.manual_seed(2025)
2025-05-07T20:33:43.1084507Z     
2025-05-07T20:33:43.1085015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.1085668Z     
2025-05-07T20:33:43.1086035Z >       x_sign = torch.sign(x)
2025-05-07T20:33:43.1089933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.1093762Z 
2025-05-07T20:33:43.1094006Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:43.1094429Z 
2025-05-07T20:33:43.1094637Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.1095444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.1096241Z     T=1,
2025-05-07T20:33:43.1096584Z     D=7168,
2025-05-07T20:33:43.1096954Z     scale_ub=1200.0,
2025-05-07T20:33:43.1097380Z     contiguous=True,
2025-05-07T20:33:43.1097809Z     compiled=False,
2025-05-07T20:33:43.1098193Z )
2025-05-07T20:33:43.1098799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.1099740Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.1100256Z 
2025-05-07T20:33:43.1100404Z     @given(
2025-05-07T20:33:43.1100840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.1101462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.1102060Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.1102896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.1103659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.1104223Z     )
2025-05-07T20:33:43.1104918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.1105791Z     def test_silu_mul_quant(
2025-05-07T20:33:43.1106273Z         self,
2025-05-07T20:33:43.1106644Z         T: int,
2025-05-07T20:33:43.1107027Z         D: int,
2025-05-07T20:33:43.1107443Z         scale_ub: Optional[float],
2025-05-07T20:33:43.1107953Z         contiguous: bool,
2025-05-07T20:33:43.1108418Z         compiled: bool,
2025-05-07T20:33:43.1108833Z     ) -> None:
2025-05-07T20:33:43.1109226Z         torch.manual_seed(2025)
2025-05-07T20:33:43.1109675Z     
2025-05-07T20:33:43.1110158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.1110867Z     
2025-05-07T20:33:43.1111225Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.1111768Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.1112371Z         x = x_sign * x_clamp
2025-05-07T20:33:43.1112838Z         x0 = x[:, :D]
2025-05-07T20:33:43.1113237Z         x1 = x[:, D:]
2025-05-07T20:33:43.1113745Z     
2025-05-07T20:33:43.1114097Z         if contiguous:
2025-05-07T20:33:43.1114532Z             x0 = x0.contiguous()
2025-05-07T20:33:43.1115134Z             x1 = x1.contiguous()
2025-05-07T20:33:43.1115704Z     
2025-05-07T20:33:43.1116089Z         if scale_ub is not None:
2025-05-07T20:33:43.1116611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.1117239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.1117839Z             )
2025-05-07T20:33:43.1118213Z         else:
2025-05-07T20:33:43.1118604Z             scale_ub_tensor = None
2025-05-07T20:33:43.1119063Z     
2025-05-07T20:33:43.1119477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.1120052Z             op = silu_mul_quant
2025-05-07T20:33:43.1120543Z             if compiled:
2025-05-07T20:33:43.1121021Z                 op = torch.compile(op)
2025-05-07T20:33:43.1121583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.1122094Z     
2025-05-07T20:33:43.1122453Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.1122773Z 
2025-05-07T20:33:43.1122964Z moe/activation_test.py:117: 
2025-05-07T20:33:43.1123546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.1124423Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.1124973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.1126319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.1127672Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.1128735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.1130042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.1131354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.1132409Z     kernel = self.compile(
2025-05-07T20:33:43.1133471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.1134748Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.1135529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.1135990Z 
2025-05-07T20:33:43.1136365Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06b71000>
2025-05-07T20:33:43.1138480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.1141580Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4c4c0>}
2025-05-07T20:33:43.1144277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.1146298Z context = <triton._C.libtriton.ir.context object at 0x7fad06b300f0>
2025-05-07T20:33:43.1146867Z 
2025-05-07T20:33:43.1147200Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.1148224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.1149136Z                            module_map=module_map)
2025-05-07T20:33:43.1149834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.1150519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.1151022Z E       ^
2025-05-07T20:33:43.1151929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.1152820Z 
2025-05-07T20:33:43.1153759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.1154921Z 
2025-05-07T20:33:43.1155230Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.1156038Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.1167061Z     T=128,
2025-05-07T20:33:43.1167479Z     D=5120,
2025-05-07T20:33:43.1167838Z     scale_ub=None,
2025-05-07T20:33:43.1168256Z     contiguous=True,
2025-05-07T20:33:43.1168688Z     compiled=False,
2025-05-07T20:33:43.1169077Z )
2025-05-07T20:33:43.2006074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.2007089Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.2007606Z 
2025-05-07T20:33:43.2007746Z     @given(
2025-05-07T20:33:43.2008167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.2008728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.2009304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.2009938Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.2010578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.2011132Z     )
2025-05-07T20:33:43.2011801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.2012666Z     def test_silu_mul_quant(
2025-05-07T20:33:43.2013116Z         self,
2025-05-07T20:33:43.2013487Z         T: int,
2025-05-07T20:33:43.2013860Z         D: int,
2025-05-07T20:33:43.2014261Z         scale_ub: Optional[float],
2025-05-07T20:33:43.2014774Z         contiguous: bool,
2025-05-07T20:33:43.2015224Z         compiled: bool,
2025-05-07T20:33:43.2015641Z     ) -> None:
2025-05-07T20:33:43.2016049Z         torch.manual_seed(2025)
2025-05-07T20:33:43.2016515Z     
2025-05-07T20:33:43.2017030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.2017700Z     
2025-05-07T20:33:43.2018071Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.2018621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.2019224Z         x = x_sign * x_clamp
2025-05-07T20:33:43.2019683Z         x0 = x[:, :D]
2025-05-07T20:33:43.2020098Z         x1 = x[:, D:]
2025-05-07T20:33:43.2020494Z     
2025-05-07T20:33:43.2020854Z         if contiguous:
2025-05-07T20:33:43.2021301Z             x0 = x0.contiguous()
2025-05-07T20:33:43.2021795Z             x1 = x1.contiguous()
2025-05-07T20:33:43.2022263Z     
2025-05-07T20:33:43.2022635Z         if scale_ub is not None:
2025-05-07T20:33:43.2023162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.2024060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.2025110Z             )
2025-05-07T20:33:43.2025480Z         else:
2025-05-07T20:33:43.2026022Z             scale_ub_tensor = None
2025-05-07T20:33:43.2026521Z     
2025-05-07T20:33:43.2026959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.2027556Z             op = silu_mul_quant
2025-05-07T20:33:43.2028022Z             if compiled:
2025-05-07T20:33:43.2028482Z                 op = torch.compile(op)
2025-05-07T20:33:43.2029046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.2029559Z     
2025-05-07T20:33:43.2029931Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.2030261Z 
2025-05-07T20:33:43.2030448Z moe/activation_test.py:117: 
2025-05-07T20:33:43.2031012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.2031660Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.2032209Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.2033653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.2035054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.2036109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.2037468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.2039048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.2040052Z     kernel = self.compile(
2025-05-07T20:33:43.2041090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.2042371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.2043127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.2043566Z 
2025-05-07T20:33:43.2043934Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d696f0>
2025-05-07T20:33:43.2046044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.2048771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4c940>}
2025-05-07T20:33:43.2051408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.2053401Z context = <triton._C.libtriton.ir.context object at 0x7fad06d576f0>
2025-05-07T20:33:43.2053970Z 
2025-05-07T20:33:43.2054280Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.2055293Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.2056205Z                            module_map=module_map)
2025-05-07T20:33:43.2056886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.2057537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.2058039Z E       ^
2025-05-07T20:33:43.2058932Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.2059828Z 
2025-05-07T20:33:43.2060643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.2061655Z 
2025-05-07T20:33:43.2061850Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.2062637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.2063406Z     T=128,
2025-05-07T20:33:43.2063767Z     D=7168,
2025-05-07T20:33:43.2064127Z     scale_ub=None,
2025-05-07T20:33:43.2064640Z     contiguous=True,
2025-05-07T20:33:43.2065064Z     compiled=False,
2025-05-07T20:33:43.2065527Z )
2025-05-07T20:33:43.2066126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.2067062Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.2067581Z 
2025-05-07T20:33:43.2067738Z     @given(
2025-05-07T20:33:43.2068178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.2068760Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.2069342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.2069980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.2070608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.2071144Z     )
2025-05-07T20:33:43.2071810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.2072651Z     def test_silu_mul_quant(
2025-05-07T20:33:43.2073121Z         self,
2025-05-07T20:33:43.2073488Z         T: int,
2025-05-07T20:33:43.2073962Z         D: int,
2025-05-07T20:33:43.2074374Z         scale_ub: Optional[float],
2025-05-07T20:33:43.2074889Z         contiguous: bool,
2025-05-07T20:33:43.2075334Z         compiled: bool,
2025-05-07T20:33:43.2075753Z     ) -> None:
2025-05-07T20:33:43.2076252Z         torch.manual_seed(2025)
2025-05-07T20:33:43.2076768Z     
2025-05-07T20:33:43.2077280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.2077942Z     
2025-05-07T20:33:43.2078311Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.2078845Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.2079441Z         x = x_sign * x_clamp
2025-05-07T20:33:43.2079903Z         x0 = x[:, :D]
2025-05-07T20:33:43.2080304Z         x1 = x[:, D:]
2025-05-07T20:33:43.2080704Z     
2025-05-07T20:33:43.2081065Z         if contiguous:
2025-05-07T20:33:43.2081517Z             x0 = x0.contiguous()
2025-05-07T20:33:43.2082036Z             x1 = x1.contiguous()
2025-05-07T20:33:43.2082500Z     
2025-05-07T20:33:43.2082862Z         if scale_ub is not None:
2025-05-07T20:33:43.2083387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.2084016Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.2084605Z             )
2025-05-07T20:33:43.2084973Z         else:
2025-05-07T20:33:43.2085378Z             scale_ub_tensor = None
2025-05-07T20:33:43.2085857Z     
2025-05-07T20:33:43.2086302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.2086913Z             op = silu_mul_quant
2025-05-07T20:33:43.2087395Z             if compiled:
2025-05-07T20:33:43.2087866Z                 op = torch.compile(op)
2025-05-07T20:33:43.2088430Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.2088961Z     
2025-05-07T20:33:43.2089318Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.2089640Z 
2025-05-07T20:33:43.2089826Z moe/activation_test.py:117: 
2025-05-07T20:33:43.2090426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.2091124Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.2091674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.2093042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.2094427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.2095449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.2096780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.2098106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.2099167Z     kernel = self.compile(
2025-05-07T20:33:43.2100178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.2101668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.2102456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.2102904Z 
2025-05-07T20:33:43.2103298Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d49480>
2025-05-07T20:33:43.2105462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.2108243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4d240>}
2025-05-07T20:33:43.2110967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.2113030Z context = <triton._C.libtriton.ir.context object at 0x7fad06dd2e70>
2025-05-07T20:33:43.2113681Z 
2025-05-07T20:33:43.2114005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.2115184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.2116122Z                            module_map=module_map)
2025-05-07T20:33:43.2116820Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.2117503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.2118009Z E       ^
2025-05-07T20:33:43.2118932Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.2119838Z 
2025-05-07T20:33:43.2120694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.2121725Z 
2025-05-07T20:33:43.2121928Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.2122732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.2123491Z     T=2048,
2025-05-07T20:33:43.2124585Z     D=7168,
2025-05-07T20:33:43.2124973Z     scale_ub=1200.0,
2025-05-07T20:33:43.2125393Z     contiguous=True,
2025-05-07T20:33:43.2125812Z     compiled=False,
2025-05-07T20:33:43.2126199Z )
2025-05-07T20:33:43.3166256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.3167278Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.3167802Z 
2025-05-07T20:33:43.3167955Z     @given(
2025-05-07T20:33:43.3168348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.3168888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.3169445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.3170077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.3170735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.3171312Z     )
2025-05-07T20:33:43.3171965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.3172825Z     def test_silu_mul_quant(
2025-05-07T20:33:43.3173317Z         self,
2025-05-07T20:33:43.3173724Z         T: int,
2025-05-07T20:33:43.3174106Z         D: int,
2025-05-07T20:33:43.3174527Z         scale_ub: Optional[float],
2025-05-07T20:33:43.3175042Z         contiguous: bool,
2025-05-07T20:33:43.3175510Z         compiled: bool,
2025-05-07T20:33:43.3175939Z     ) -> None:
2025-05-07T20:33:43.3176342Z         torch.manual_seed(2025)
2025-05-07T20:33:43.3176814Z     
2025-05-07T20:33:43.3177331Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.3181775Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.3185763Z 
2025-05-07T20:33:43.3186009Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.3186426Z 
2025-05-07T20:33:43.3186628Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.3187445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.3188232Z     T=1,
2025-05-07T20:33:43.3188581Z     D=5120,
2025-05-07T20:33:43.3188953Z     scale_ub=1200.0,
2025-05-07T20:33:43.3189371Z     contiguous=True,
2025-05-07T20:33:43.3189794Z     compiled=False,
2025-05-07T20:33:43.3190188Z )
2025-05-07T20:33:43.3190808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.3191765Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.3192295Z 
2025-05-07T20:33:43.3192445Z     @given(
2025-05-07T20:33:43.3192895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.3193924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.3194523Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.3195182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.3195845Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.3196410Z     )
2025-05-07T20:33:43.3197103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.3197999Z     def test_silu_mul_quant(
2025-05-07T20:33:43.3198474Z         self,
2025-05-07T20:33:43.3198842Z         T: int,
2025-05-07T20:33:43.3199229Z         D: int,
2025-05-07T20:33:43.3199654Z         scale_ub: Optional[float],
2025-05-07T20:33:43.3200170Z         contiguous: bool,
2025-05-07T20:33:43.3200629Z         compiled: bool,
2025-05-07T20:33:43.3201049Z     ) -> None:
2025-05-07T20:33:43.3201448Z         torch.manual_seed(2025)
2025-05-07T20:33:43.3201908Z     
2025-05-07T20:33:43.3202433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.3203106Z     
2025-05-07T20:33:43.3203481Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.3204011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.3204592Z         x = x_sign * x_clamp
2025-05-07T20:33:43.3205057Z         x0 = x[:, :D]
2025-05-07T20:33:43.3205470Z         x1 = x[:, D:]
2025-05-07T20:33:43.3205861Z     
2025-05-07T20:33:43.3206222Z         if contiguous:
2025-05-07T20:33:43.3206666Z             x0 = x0.contiguous()
2025-05-07T20:33:43.3207169Z             x1 = x1.contiguous()
2025-05-07T20:33:43.3207628Z     
2025-05-07T20:33:43.3208009Z         if scale_ub is not None:
2025-05-07T20:33:43.3208532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.3209172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.3209779Z             )
2025-05-07T20:33:43.3210157Z         else:
2025-05-07T20:33:43.3210572Z             scale_ub_tensor = None
2025-05-07T20:33:43.3211100Z     
2025-05-07T20:33:43.3211545Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.3212132Z             op = silu_mul_quant
2025-05-07T20:33:43.3212615Z             if compiled:
2025-05-07T20:33:43.3213093Z                 op = torch.compile(op)
2025-05-07T20:33:43.3213652Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.3214193Z     
2025-05-07T20:33:43.3214561Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.3214878Z 
2025-05-07T20:33:43.3215078Z moe/activation_test.py:117: 
2025-05-07T20:33:43.3215638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.3216407Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.3217030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.3218398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.3219750Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.3220866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.3222218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.3223526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.3224924Z     kernel = self.compile(
2025-05-07T20:33:43.3225985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.3227271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.3228058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.3228519Z 
2025-05-07T20:33:43.3228909Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d49b10>
2025-05-07T20:33:43.3231193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.3234122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4e200>}
2025-05-07T20:33:43.3236842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.3238865Z context = <triton._C.libtriton.ir.context object at 0x7fad06c710b0>
2025-05-07T20:33:43.3239445Z 
2025-05-07T20:33:43.3239783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.3240797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.3241716Z                            module_map=module_map)
2025-05-07T20:33:43.3242431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.3243103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.3243600Z E       ^
2025-05-07T20:33:43.3244517Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.3245420Z 
2025-05-07T20:33:43.3246255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.3247287Z 
2025-05-07T20:33:43.3247496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.3248310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.3249111Z     T=2048,
2025-05-07T20:33:43.3249476Z     D=5120,
2025-05-07T20:33:43.3249845Z     scale_ub=None,
2025-05-07T20:33:43.3250260Z     contiguous=True,
2025-05-07T20:33:43.3250704Z     compiled=False,
2025-05-07T20:33:43.3251120Z )
2025-05-07T20:33:43.3251748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.3252714Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.3253246Z 
2025-05-07T20:33:43.3253407Z     @given(
2025-05-07T20:33:43.3253837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.3254444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.3255043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.3255665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.3256302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.3257018Z     )
2025-05-07T20:33:43.3257786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.3258677Z     def test_silu_mul_quant(
2025-05-07T20:33:43.3259150Z         self,
2025-05-07T20:33:43.3259511Z         T: int,
2025-05-07T20:33:43.3259894Z         D: int,
2025-05-07T20:33:43.3260321Z         scale_ub: Optional[float],
2025-05-07T20:33:43.3260833Z         contiguous: bool,
2025-05-07T20:33:43.3261283Z         compiled: bool,
2025-05-07T20:33:43.3261697Z     ) -> None:
2025-05-07T20:33:43.3262092Z         torch.manual_seed(2025)
2025-05-07T20:33:43.3262558Z     
2025-05-07T20:33:43.3263077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.3263746Z     
2025-05-07T20:33:43.3264106Z >       x_sign = torch.sign(x)
2025-05-07T20:33:43.3268145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.3272088Z 
2025-05-07T20:33:43.3272322Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:43.3272743Z 
2025-05-07T20:33:43.3272954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.3273831Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.3274637Z     T=16384,
2025-05-07T20:33:43.3275014Z     D=5120,
2025-05-07T20:33:43.3275386Z     scale_ub=None,
2025-05-07T20:33:43.3275790Z     contiguous=True,
2025-05-07T20:33:43.3276224Z     compiled=False,
2025-05-07T20:33:43.3276624Z )
2025-05-07T20:33:43.4325536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.4326587Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.4327110Z 
2025-05-07T20:33:43.4327260Z     @given(
2025-05-07T20:33:43.4327681Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.4328265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.4328833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.4329464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.4330084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.4330672Z     )
2025-05-07T20:33:43.4331359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.4332223Z     def test_silu_mul_quant(
2025-05-07T20:33:43.4332701Z         self,
2025-05-07T20:33:43.4333060Z         T: int,
2025-05-07T20:33:43.4333438Z         D: int,
2025-05-07T20:33:43.4333852Z         scale_ub: Optional[float],
2025-05-07T20:33:43.4334376Z         contiguous: bool,
2025-05-07T20:33:43.4334841Z         compiled: bool,
2025-05-07T20:33:43.4335265Z     ) -> None:
2025-05-07T20:33:43.4335672Z         torch.manual_seed(2025)
2025-05-07T20:33:43.4336130Z     
2025-05-07T20:33:43.4336642Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.4340781Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.4344635Z 
2025-05-07T20:33:43.4345162Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.4345602Z 
2025-05-07T20:33:43.4345919Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.4346724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.4347515Z     T=4096,
2025-05-07T20:33:43.4347865Z     D=5120,
2025-05-07T20:33:43.4348240Z     scale_ub=None,
2025-05-07T20:33:43.4348650Z     contiguous=True,
2025-05-07T20:33:43.4349072Z     compiled=False,
2025-05-07T20:33:43.4349470Z )
2025-05-07T20:33:43.4350089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.4351045Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.4351601Z 
2025-05-07T20:33:43.4351751Z     @given(
2025-05-07T20:33:43.4352188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.4352801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.4353391Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.4354202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.4354858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.4355410Z     )
2025-05-07T20:33:43.4356101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.4357124Z     def test_silu_mul_quant(
2025-05-07T20:33:43.4357584Z         self,
2025-05-07T20:33:43.4358092Z         T: int,
2025-05-07T20:33:43.4358475Z         D: int,
2025-05-07T20:33:43.4358863Z         scale_ub: Optional[float],
2025-05-07T20:33:43.4359366Z         contiguous: bool,
2025-05-07T20:33:43.4359811Z         compiled: bool,
2025-05-07T20:33:43.4360222Z     ) -> None:
2025-05-07T20:33:43.4360659Z         torch.manual_seed(2025)
2025-05-07T20:33:43.4361143Z     
2025-05-07T20:33:43.4361651Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.4365699Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.4369369Z 
2025-05-07T20:33:43.4369604Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.4370029Z 
2025-05-07T20:33:43.4370220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.4382517Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.4383330Z     T=2048,
2025-05-07T20:33:43.4383696Z     D=5120,
2025-05-07T20:33:43.4384044Z     scale_ub=None,
2025-05-07T20:33:43.4384456Z     contiguous=False,
2025-05-07T20:33:43.4384904Z     compiled=False,
2025-05-07T20:33:43.4385290Z )
2025-05-07T20:33:43.4385905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.4386881Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:43.4387418Z 
2025-05-07T20:33:43.4387565Z     @given(
2025-05-07T20:33:43.4388011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.4388619Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.4389214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.4389850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.4390505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.4391084Z     )
2025-05-07T20:33:43.4391742Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.4392601Z     def test_silu_mul_quant(
2025-05-07T20:33:43.4393071Z         self,
2025-05-07T20:33:43.4393432Z         T: int,
2025-05-07T20:33:43.4394034Z         D: int,
2025-05-07T20:33:43.4394448Z         scale_ub: Optional[float],
2025-05-07T20:33:43.4395033Z         contiguous: bool,
2025-05-07T20:33:43.4395496Z         compiled: bool,
2025-05-07T20:33:43.4395924Z     ) -> None:
2025-05-07T20:33:43.4396322Z         torch.manual_seed(2025)
2025-05-07T20:33:43.4396788Z     
2025-05-07T20:33:43.4397310Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.4401385Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.4405121Z 
2025-05-07T20:33:43.4405377Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.4405798Z 
2025-05-07T20:33:43.4405999Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.4406803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.4407703Z     T=4096,
2025-05-07T20:33:43.4408058Z     D=7168,
2025-05-07T20:33:43.4408500Z     scale_ub=None,
2025-05-07T20:33:43.4408925Z     contiguous=True,
2025-05-07T20:33:43.4409344Z     compiled=True,
2025-05-07T20:33:43.4409743Z )
2025-05-07T20:33:43.4410355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.4411377Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:43.4411905Z 
2025-05-07T20:33:43.4412055Z     @given(
2025-05-07T20:33:43.4412502Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.4413100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.4413695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.4414332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.4414949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.4415492Z     )
2025-05-07T20:33:43.4416157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.4417038Z     def test_silu_mul_quant(
2025-05-07T20:33:43.4417508Z         self,
2025-05-07T20:33:43.4417878Z         T: int,
2025-05-07T20:33:43.4418253Z         D: int,
2025-05-07T20:33:43.4418672Z         scale_ub: Optional[float],
2025-05-07T20:33:43.4419179Z         contiguous: bool,
2025-05-07T20:33:43.4419632Z         compiled: bool,
2025-05-07T20:33:43.4420027Z     ) -> None:
2025-05-07T20:33:43.4420436Z         torch.manual_seed(2025)
2025-05-07T20:33:43.4420946Z     
2025-05-07T20:33:43.4421454Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.4425836Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.4429654Z 
2025-05-07T20:33:43.4429885Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.4430304Z 
2025-05-07T20:33:43.4430507Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.4431305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.4432095Z     T=2048,
2025-05-07T20:33:43.4432457Z     D=5120,
2025-05-07T20:33:43.4432830Z     scale_ub=1200.0,
2025-05-07T20:33:43.4433422Z     contiguous=False,
2025-05-07T20:33:43.4433940Z     compiled=False,
2025-05-07T20:33:43.4434435Z )
2025-05-07T20:33:43.4435046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.4436015Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:43.4436561Z 
2025-05-07T20:33:43.4436715Z     @given(
2025-05-07T20:33:43.4437145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.4437748Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.4438340Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.4438971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.4439609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.4440167Z     )
2025-05-07T20:33:43.4440850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.4441706Z     def test_silu_mul_quant(
2025-05-07T20:33:43.4442180Z         self,
2025-05-07T20:33:43.4442548Z         T: int,
2025-05-07T20:33:43.4442919Z         D: int,
2025-05-07T20:33:43.4443332Z         scale_ub: Optional[float],
2025-05-07T20:33:43.4443826Z         contiguous: bool,
2025-05-07T20:33:43.4444274Z         compiled: bool,
2025-05-07T20:33:43.4444695Z     ) -> None:
2025-05-07T20:33:43.4445231Z         torch.manual_seed(2025)
2025-05-07T20:33:43.4445775Z     
2025-05-07T20:33:43.4446290Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.4450300Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.4454193Z 
2025-05-07T20:33:43.4454423Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.4454843Z 
2025-05-07T20:33:43.4455051Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.4455857Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.4456655Z     T=4096,
2025-05-07T20:33:43.4457016Z     D=7168,
2025-05-07T20:33:43.4457376Z     scale_ub=1200.0,
2025-05-07T20:33:43.4457807Z     contiguous=True,
2025-05-07T20:33:43.4458235Z     compiled=False,
2025-05-07T20:33:43.4458620Z )
2025-05-07T20:33:43.5824563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.5825594Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.5826122Z 
2025-05-07T20:33:43.5826264Z     @given(
2025-05-07T20:33:43.5826705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.5827304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.5827885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.5828493Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.5829067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.5829563Z     )
2025-05-07T20:33:43.5830178Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.5831031Z     def test_silu_mul_quant(
2025-05-07T20:33:43.5831448Z         self,
2025-05-07T20:33:43.5831801Z         T: int,
2025-05-07T20:33:43.5832173Z         D: int,
2025-05-07T20:33:43.5832565Z         scale_ub: Optional[float],
2025-05-07T20:33:43.5833087Z         contiguous: bool,
2025-05-07T20:33:43.5833646Z         compiled: bool,
2025-05-07T20:33:43.5834072Z     ) -> None:
2025-05-07T20:33:43.5834482Z         torch.manual_seed(2025)
2025-05-07T20:33:43.5834939Z     
2025-05-07T20:33:43.5835443Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.5839979Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.5843859Z 
2025-05-07T20:33:43.5844089Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.5844507Z 
2025-05-07T20:33:43.5844704Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.5845501Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.5846275Z     T=16384,
2025-05-07T20:33:43.5846651Z     D=7168,
2025-05-07T20:33:43.5847018Z     scale_ub=None,
2025-05-07T20:33:43.5847424Z     contiguous=False,
2025-05-07T20:33:43.5847859Z     compiled=True,
2025-05-07T20:33:43.5848254Z )
2025-05-07T20:33:43.5848858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.5849973Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:43.5850641Z 
2025-05-07T20:33:43.5850801Z     @given(
2025-05-07T20:33:43.5851248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.5851846Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.5852432Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.5853075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.5853704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.5854259Z     )
2025-05-07T20:33:43.5854935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.5855808Z     def test_silu_mul_quant(
2025-05-07T20:33:43.5856265Z         self,
2025-05-07T20:33:43.5856640Z         T: int,
2025-05-07T20:33:43.5857016Z         D: int,
2025-05-07T20:33:43.5857415Z         scale_ub: Optional[float],
2025-05-07T20:33:43.5857936Z         contiguous: bool,
2025-05-07T20:33:43.5858402Z         compiled: bool,
2025-05-07T20:33:43.5858815Z     ) -> None:
2025-05-07T20:33:43.5859210Z         torch.manual_seed(2025)
2025-05-07T20:33:43.5859620Z     
2025-05-07T20:33:43.5860116Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.5864203Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.5867899Z 
2025-05-07T20:33:43.5868135Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.5868560Z 
2025-05-07T20:33:43.5868769Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.5869571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.5870338Z     T=4096,
2025-05-07T20:33:43.5870706Z     D=7168,
2025-05-07T20:33:43.5871070Z     scale_ub=None,
2025-05-07T20:33:43.5871476Z     contiguous=True,
2025-05-07T20:33:43.5871908Z     compiled=False,
2025-05-07T20:33:43.5872301Z )
2025-05-07T20:33:43.5872893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.5873991Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.5874515Z 
2025-05-07T20:33:43.5874791Z     @given(
2025-05-07T20:33:43.5875223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.5876835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.5877459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.5878100Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.5878746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.5879329Z     )
2025-05-07T20:33:43.5880020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.5880831Z     def test_silu_mul_quant(
2025-05-07T20:33:43.5881266Z         self,
2025-05-07T20:33:43.5881629Z         T: int,
2025-05-07T20:33:43.5881989Z         D: int,
2025-05-07T20:33:43.5882400Z         scale_ub: Optional[float],
2025-05-07T20:33:43.5882928Z         contiguous: bool,
2025-05-07T20:33:43.5883383Z         compiled: bool,
2025-05-07T20:33:43.5883818Z     ) -> None:
2025-05-07T20:33:43.5884237Z         torch.manual_seed(2025)
2025-05-07T20:33:43.5884702Z     
2025-05-07T20:33:43.5885231Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.5889419Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.5893632Z 
2025-05-07T20:33:43.5893873Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.5894287Z 
2025-05-07T20:33:43.5894497Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.5895283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.5896073Z     T=16384,
2025-05-07T20:33:43.5896454Z     D=7168,
2025-05-07T20:33:43.5896826Z     scale_ub=None,
2025-05-07T20:33:43.5897248Z     contiguous=True,
2025-05-07T20:33:43.5897681Z     compiled=False,
2025-05-07T20:33:43.5898070Z )
2025-05-07T20:33:43.5898676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.5899657Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:43.5900190Z 
2025-05-07T20:33:43.5900353Z     @given(
2025-05-07T20:33:43.5900782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.5901389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.5901980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.5902607Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.5903245Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.5903794Z     )
2025-05-07T20:33:43.5904452Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.5905311Z     def test_silu_mul_quant(
2025-05-07T20:33:43.5905785Z         self,
2025-05-07T20:33:43.5906161Z         T: int,
2025-05-07T20:33:43.5906536Z         D: int,
2025-05-07T20:33:43.5906959Z         scale_ub: Optional[float],
2025-05-07T20:33:43.5907491Z         contiguous: bool,
2025-05-07T20:33:43.5907955Z         compiled: bool,
2025-05-07T20:33:43.5908392Z     ) -> None:
2025-05-07T20:33:43.5908804Z         torch.manual_seed(2025)
2025-05-07T20:33:43.5909266Z     
2025-05-07T20:33:43.5909797Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.5914065Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.5917802Z 
2025-05-07T20:33:43.5918055Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.5918468Z 
2025-05-07T20:33:43.5918687Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.5919477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.5920263Z     T=16384,
2025-05-07T20:33:43.5920641Z     D=7168,
2025-05-07T20:33:43.5921037Z     scale_ub=1200.0,
2025-05-07T20:33:43.5921490Z     contiguous=True,
2025-05-07T20:33:43.5921919Z     compiled=False,
2025-05-07T20:33:43.5922308Z )
2025-05-07T20:33:43.5922917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.5924115Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.5924671Z 
2025-05-07T20:33:43.5924823Z     @given(
2025-05-07T20:33:43.5925269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.5925878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.5926470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.5927261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.5927970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.5928497Z     )
2025-05-07T20:33:43.5929151Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.5930007Z     def test_silu_mul_quant(
2025-05-07T20:33:43.5930483Z         self,
2025-05-07T20:33:43.5930850Z         T: int,
2025-05-07T20:33:43.5931230Z         D: int,
2025-05-07T20:33:43.5931653Z         scale_ub: Optional[float],
2025-05-07T20:33:43.5932168Z         contiguous: bool,
2025-05-07T20:33:43.5932639Z         compiled: bool,
2025-05-07T20:33:43.5933071Z     ) -> None:
2025-05-07T20:33:43.5933476Z         torch.manual_seed(2025)
2025-05-07T20:33:43.5933956Z     
2025-05-07T20:33:43.5934474Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.5938501Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.5942190Z 
2025-05-07T20:33:43.5942438Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.5942852Z 
2025-05-07T20:33:43.5943056Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.5943863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.5944648Z     T=128,
2025-05-07T20:33:43.5945002Z     D=5120,
2025-05-07T20:33:43.5945372Z     scale_ub=1200.0,
2025-05-07T20:33:43.5945806Z     contiguous=False,
2025-05-07T20:33:43.5946225Z     compiled=False,
2025-05-07T20:33:43.5946632Z )
2025-05-07T20:33:43.7483813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.7484900Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:43.7485457Z 
2025-05-07T20:33:43.7485633Z     @given(
2025-05-07T20:33:43.7486105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.7486740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.7487366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.7488033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.7488699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.7489506Z     )
2025-05-07T20:33:43.7490408Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.7491031Z     def test_silu_mul_quant(
2025-05-07T20:33:43.7491325Z         self,
2025-05-07T20:33:43.7491560Z         T: int,
2025-05-07T20:33:43.7491797Z         D: int,
2025-05-07T20:33:43.7492056Z         scale_ub: Optional[float],
2025-05-07T20:33:43.7492378Z         contiguous: bool,
2025-05-07T20:33:43.7492657Z         compiled: bool,
2025-05-07T20:33:43.7492927Z     ) -> None:
2025-05-07T20:33:43.7493184Z         torch.manual_seed(2025)
2025-05-07T20:33:43.7493461Z     
2025-05-07T20:33:43.7493780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.7494176Z     
2025-05-07T20:33:43.7494400Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.7494741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.7495107Z         x = x_sign * x_clamp
2025-05-07T20:33:43.7495390Z         x0 = x[:, :D]
2025-05-07T20:33:43.7495651Z         x1 = x[:, D:]
2025-05-07T20:33:43.7495908Z     
2025-05-07T20:33:43.7496123Z         if contiguous:
2025-05-07T20:33:43.7496405Z             x0 = x0.contiguous()
2025-05-07T20:33:43.7496714Z             x1 = x1.contiguous()
2025-05-07T20:33:43.7496995Z     
2025-05-07T20:33:43.7497318Z         if scale_ub is not None:
2025-05-07T20:33:43.7497717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.7498111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.7498466Z             )
2025-05-07T20:33:43.7498697Z         else:
2025-05-07T20:33:43.7498953Z             scale_ub_tensor = None
2025-05-07T20:33:43.7499243Z     
2025-05-07T20:33:43.7499522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.7499892Z             op = silu_mul_quant
2025-05-07T20:33:43.7500184Z             if compiled:
2025-05-07T20:33:43.7500538Z                 op = torch.compile(op)
2025-05-07T20:33:43.7500983Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.7501383Z     
2025-05-07T20:33:43.7501681Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.7501919Z 
2025-05-07T20:33:43.7502071Z moe/activation_test.py:117: 
2025-05-07T20:33:43.7502457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.7502851Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.7503188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.7503985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.7504770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.7505385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.7506169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.7506927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.7507540Z     kernel = self.compile(
2025-05-07T20:33:43.7508165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.7508916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.7509371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.7509642Z 
2025-05-07T20:33:43.7509883Z self = <triton.compiler.compiler.ASTSource object at 0x7fad069d3d60>
2025-05-07T20:33:43.7511115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.7512686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06c11ea0>}
2025-05-07T20:33:43.7514420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.7515587Z context = <triton._C.libtriton.ir.context object at 0x7fad06a16f70>
2025-05-07T20:33:43.7515928Z 
2025-05-07T20:33:43.7516119Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.7516718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.7517257Z                            module_map=module_map)
2025-05-07T20:33:43.7517674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.7518085Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.7518397Z E       ^
2025-05-07T20:33:43.7518929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.7519452Z 
2025-05-07T20:33:43.7519932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.7520524Z 
2025-05-07T20:33:43.7520651Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.7521225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.7521684Z     T=2048,
2025-05-07T20:33:43.7521910Z     D=7168,
2025-05-07T20:33:43.7522139Z     scale_ub=None,
2025-05-07T20:33:43.7522393Z     contiguous=False,
2025-05-07T20:33:43.7522660Z     compiled=False,
2025-05-07T20:33:43.7522904Z )
2025-05-07T20:33:43.7523270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.7524002Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:43.7524317Z 
2025-05-07T20:33:43.7524413Z     @given(
2025-05-07T20:33:43.7524683Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.7525051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.7525407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.7525794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.7526173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.7526512Z     )
2025-05-07T20:33:43.7526924Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.7527425Z     def test_silu_mul_quant(
2025-05-07T20:33:43.7527711Z         self,
2025-05-07T20:33:43.7527942Z         T: int,
2025-05-07T20:33:43.7528169Z         D: int,
2025-05-07T20:33:43.7528425Z         scale_ub: Optional[float],
2025-05-07T20:33:43.7528741Z         contiguous: bool,
2025-05-07T20:33:43.7529021Z         compiled: bool,
2025-05-07T20:33:43.7529280Z     ) -> None:
2025-05-07T20:33:43.7529533Z         torch.manual_seed(2025)
2025-05-07T20:33:43.7529821Z     
2025-05-07T20:33:43.7530142Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.7532531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.7534640Z 
2025-05-07T20:33:43.7534781Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:43.7535028Z 
2025-05-07T20:33:43.7543037Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.7543550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.7544127Z     T=128,
2025-05-07T20:33:43.7544342Z     D=7168,
2025-05-07T20:33:43.7544634Z     scale_ub=1200.0,
2025-05-07T20:33:43.7544906Z     contiguous=True,
2025-05-07T20:33:43.7545166Z     compiled=True,
2025-05-07T20:33:43.7545413Z )
2025-05-07T20:33:43.7989533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.7990958Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:43.7991386Z 
2025-05-07T20:33:43.7991530Z     @given(
2025-05-07T20:33:43.7991907Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.7992393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.7992857Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.7993325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.7993796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.7994139Z     )
2025-05-07T20:33:43.7994549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.7995071Z     def test_silu_mul_quant(
2025-05-07T20:33:43.7995368Z         self,
2025-05-07T20:33:43.7995601Z         T: int,
2025-05-07T20:33:43.7995850Z         D: int,
2025-05-07T20:33:43.7996119Z         scale_ub: Optional[float],
2025-05-07T20:33:43.7996445Z         contiguous: bool,
2025-05-07T20:33:43.7996863Z         compiled: bool,
2025-05-07T20:33:43.7997209Z     ) -> None:
2025-05-07T20:33:43.7997479Z         torch.manual_seed(2025)
2025-05-07T20:33:43.7997765Z     
2025-05-07T20:33:43.7998093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.7998496Z     
2025-05-07T20:33:43.7998727Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.7999078Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.7999449Z         x = x_sign * x_clamp
2025-05-07T20:33:43.7999735Z         x0 = x[:, :D]
2025-05-07T20:33:43.8000001Z         x1 = x[:, D:]
2025-05-07T20:33:43.8000254Z     
2025-05-07T20:33:43.8000480Z         if contiguous:
2025-05-07T20:33:43.8000765Z             x0 = x0.contiguous()
2025-05-07T20:33:43.8001082Z             x1 = x1.contiguous()
2025-05-07T20:33:43.8001393Z     
2025-05-07T20:33:43.8001631Z         if scale_ub is not None:
2025-05-07T20:33:43.8001959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.8002355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.8002722Z             )
2025-05-07T20:33:43.8002954Z         else:
2025-05-07T20:33:43.8003208Z             scale_ub_tensor = None
2025-05-07T20:33:43.8003509Z     
2025-05-07T20:33:43.8003785Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.8004149Z             op = silu_mul_quant
2025-05-07T20:33:43.8004447Z             if compiled:
2025-05-07T20:33:43.8004745Z                 op = torch.compile(op)
2025-05-07T20:33:43.8005094Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.8005418Z     
2025-05-07T20:33:43.8005654Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.8005848Z 
2025-05-07T20:33:43.8005970Z moe/activation_test.py:117: 
2025-05-07T20:33:43.8006319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.8006711Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.8007042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.8007695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:43.8008347Z     return fn(*args, **kwargs)
2025-05-07T20:33:43.8009104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.8009885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.8010504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.8011284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.8012185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.8012795Z     kernel = self.compile(
2025-05-07T20:33:43.8013423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.8014180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.8014635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.8014903Z 
2025-05-07T20:33:43.8015141Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06ad0fa0>
2025-05-07T20:33:43.8016365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.8017918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06c137f0>}
2025-05-07T20:33:43.8019440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.8020685Z context = <triton._C.libtriton.ir.context object at 0x7fad06ac5730>
2025-05-07T20:33:43.8021023Z 
2025-05-07T20:33:43.8021218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.8021816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.8022358Z                            module_map=module_map)
2025-05-07T20:33:43.8022775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.8023189Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.8023500Z E       ^
2025-05-07T20:33:43.8024224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.8024745Z 
2025-05-07T20:33:43.8025220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.8025809Z 
2025-05-07T20:33:43.8025936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.8026426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.8026884Z     T=128,
2025-05-07T20:33:43.8027114Z     D=7168,
2025-05-07T20:33:43.8027349Z     scale_ub=1200.0,
2025-05-07T20:33:43.8027610Z     contiguous=True,
2025-05-07T20:33:43.8027877Z     compiled=False,
2025-05-07T20:33:43.8028126Z )
2025-05-07T20:33:43.8028494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.8029068Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.8029382Z 
2025-05-07T20:33:43.8029483Z     @given(
2025-05-07T20:33:43.8029756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.8030126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.8030493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.8030878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.8031264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.8031602Z     )
2025-05-07T20:33:43.8032011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.8032518Z     def test_silu_mul_quant(
2025-05-07T20:33:43.8032804Z         self,
2025-05-07T20:33:43.8033045Z         T: int,
2025-05-07T20:33:43.8033279Z         D: int,
2025-05-07T20:33:43.8033611Z         scale_ub: Optional[float],
2025-05-07T20:33:43.8033933Z         contiguous: bool,
2025-05-07T20:33:43.8034214Z         compiled: bool,
2025-05-07T20:33:43.8034484Z     ) -> None:
2025-05-07T20:33:43.8034826Z         torch.manual_seed(2025)
2025-05-07T20:33:43.8035109Z     
2025-05-07T20:33:43.8035502Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.8035903Z     
2025-05-07T20:33:43.8036137Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.8036475Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.8038748Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.8040841Z 
2025-05-07T20:33:43.8040983Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:43.8041231Z 
2025-05-07T20:33:43.8041388Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.8041923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.8042392Z     T=128,
2025-05-07T20:33:43.8042620Z     D=5120,
2025-05-07T20:33:43.8042927Z     scale_ub=1200.0,
2025-05-07T20:33:43.8043192Z     contiguous=True,
2025-05-07T20:33:43.8043522Z     compiled=True,
2025-05-07T20:33:43.8043770Z )
2025-05-07T20:33:43.8044140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.8044912Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:43.8045330Z 
2025-05-07T20:33:43.8045472Z     @given(
2025-05-07T20:33:43.8045757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.8046126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.8046487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.8046871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.8047264Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.8047604Z     )
2025-05-07T20:33:43.8048016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.8048524Z     def test_silu_mul_quant(
2025-05-07T20:33:43.8048820Z         self,
2025-05-07T20:33:43.8049061Z         T: int,
2025-05-07T20:33:43.8049297Z         D: int,
2025-05-07T20:33:43.8049581Z         scale_ub: Optional[float],
2025-05-07T20:33:43.8049926Z         contiguous: bool,
2025-05-07T20:33:43.8050214Z         compiled: bool,
2025-05-07T20:33:43.8050475Z     ) -> None:
2025-05-07T20:33:43.8050734Z         torch.manual_seed(2025)
2025-05-07T20:33:43.8051026Z     
2025-05-07T20:33:43.8051343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.8051745Z     
2025-05-07T20:33:43.8051979Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.8052319Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.8054586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:43.8056899Z 
2025-05-07T20:33:43.8057498Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:43.8057755Z 
2025-05-07T20:33:43.8057881Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.8058360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.8058977Z     T=128,
2025-05-07T20:33:43.8059282Z     D=7168,
2025-05-07T20:33:43.8059512Z     scale_ub=None,
2025-05-07T20:33:43.8059807Z     contiguous=True,
2025-05-07T20:33:43.8060076Z     compiled=True,
2025-05-07T20:33:43.8060320Z )
2025-05-07T20:33:44.0196431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0197129Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0197454Z 
2025-05-07T20:33:44.0197547Z     @given(
2025-05-07T20:33:44.0197820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0198178Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0198537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0198922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0199310Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0199638Z     )
2025-05-07T20:33:44.0200045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0200556Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0200834Z         self,
2025-05-07T20:33:44.0201067Z         T: int,
2025-05-07T20:33:44.0201319Z         D: int,
2025-05-07T20:33:44.0201581Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0201905Z         contiguous: bool,
2025-05-07T20:33:44.0202186Z         compiled: bool,
2025-05-07T20:33:44.0202575Z     ) -> None:
2025-05-07T20:33:44.0202891Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0203176Z     
2025-05-07T20:33:44.0203487Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0205797Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.0207884Z 
2025-05-07T20:33:44.0208022Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.0208272Z 
2025-05-07T20:33:44.0219220Z FAILED
2025-05-07T20:33:44.0219375Z 
2025-05-07T20:33:44.0219590Z =================================== FAILURES ===================================
2025-05-07T20:33:44.0220295Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:44.0221048Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:44.0222033Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
2025-05-07T20:33:44.0222876Z   |     yield
2025-05-07T20:33:44.0223549Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run
2025-05-07T20:33:44.0224592Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:44.0225479Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
2025-05-07T20:33:44.0226361Z   |     method()
2025-05-07T20:33:44.0227490Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:44.0228649Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0229641Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:44.0230618Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:44.0231377Z   | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:44.0232182Z   +-+---------------- 1 ----------------
2025-05-07T20:33:44.0232646Z     | Traceback (most recent call last):
2025-05-07T20:33:44.0234089Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:44.0235325Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0238531Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.0241626Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:44.0242319Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0242952Z     |     T=2048,
2025-05-07T20:33:44.0243340Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:44.0243874Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:44.0244452Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:44.0245130Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:44.0245691Z     | )
2025-05-07T20:33:44.0245993Z     | 
2025-05-07T20:33:44.0246808Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:44.0247757Z     +---------------- 2 ----------------
2025-05-07T20:33:44.0248226Z     | Traceback (most recent call last):
2025-05-07T20:33:44.0249359Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:44.0250567Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0253713Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.0256741Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:44.0257437Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0258074Z     |     T=128,
2025-05-07T20:33:44.0258396Z     |     D=7168,
2025-05-07T20:33:44.0258741Z     |     scale_ub=None,
2025-05-07T20:33:44.0259131Z     |     contiguous=True,
2025-05-07T20:33:44.0259516Z     |     compiled=True,
2025-05-07T20:33:44.0259884Z     | )
2025-05-07T20:33:44.0260177Z     | 
2025-05-07T20:33:44.0260993Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:44.0261938Z     +---------------- 3 ----------------
2025-05-07T20:33:44.0262410Z     | Traceback (most recent call last):
2025-05-07T20:33:44.0263440Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:44.0264320Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0266694Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.0268946Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:44.0269455Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0269918Z     |     T=128,
2025-05-07T20:33:44.0270159Z     |     D=5120,
2025-05-07T20:33:44.0270412Z     |     scale_ub=1200.0,
2025-05-07T20:33:44.0270743Z     |     contiguous=True,
2025-05-07T20:33:44.0271142Z     |     compiled=True,
2025-05-07T20:33:44.0271517Z     | )
2025-05-07T20:33:44.0271820Z     | 
2025-05-07T20:33:44.0272675Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:44.0273812Z     +---------------- 4 ----------------
2025-05-07T20:33:44.0274308Z     | Traceback (most recent call last):
2025-05-07T20:33:44.0275466Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:44.0276768Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0277875Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:44.0278681Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0279628Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:44.0280749Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0281710Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:44.0282836Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0283995Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:44.0285252Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0286545Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in <dictcomp>
2025-05-07T20:33:44.0287856Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0289136Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:44.0290284Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0291568Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:44.0292685Z     |     fn()
2025-05-07T20:33:44.0293676Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:44.0294664Z     |     self.fn.run(
2025-05-07T20:33:44.0295502Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:44.0296425Z     |     kernel = self.compile(
2025-05-07T20:33:44.0297397Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:44.0298520Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0299697Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:44.0301147Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0302128Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0302680Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0303083Z     | ^
2025-05-07T20:33:44.0303780Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0304641Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:44.0305244Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:44.0306012Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0306687Z     |     T=1,  # or any other generated value
2025-05-07T20:33:44.0307181Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:44.0307725Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:44.0308327Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:44.0308921Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:44.0309413Z     | )
2025-05-07T20:33:44.0309787Z     | 
2025-05-07T20:33:44.0310669Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:44.0311637Z     +------------------------------------
2025-05-07T20:33:44.0312215Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:44.0312823Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0313491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0314292Z     T=1,
2025-05-07T20:33:44.0314597Z     D=5120,
2025-05-07T20:33:44.0314924Z     scale_ub=None,
2025-05-07T20:33:44.0315281Z     contiguous=True,
2025-05-07T20:33:44.0315655Z     compiled=True,
2025-05-07T20:33:44.0316003Z )
2025-05-07T20:33:44.0316524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0317306Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0317737Z 
2025-05-07T20:33:44.0317868Z     @given(
2025-05-07T20:33:44.0318255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0318758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0319262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0319807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0320337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0320812Z     )
2025-05-07T20:33:44.0321386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0322173Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0322559Z         self,
2025-05-07T20:33:44.0322884Z         T: int,
2025-05-07T20:33:44.0323206Z         D: int,
2025-05-07T20:33:44.0348671Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0349208Z         contiguous: bool,
2025-05-07T20:33:44.0349621Z         compiled: bool,
2025-05-07T20:33:44.0350005Z     ) -> None:
2025-05-07T20:33:44.0350366Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0350769Z     
2025-05-07T20:33:44.0351221Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0351763Z     
2025-05-07T20:33:44.0352090Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0352562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0353053Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0353450Z         x0 = x[:, :D]
2025-05-07T20:33:44.0353918Z         x1 = x[:, D:]
2025-05-07T20:33:44.0354269Z     
2025-05-07T20:33:44.0354588Z         if contiguous:
2025-05-07T20:33:44.0354977Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0355406Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0356031Z     
2025-05-07T20:33:44.0356470Z         if scale_ub is not None:
2025-05-07T20:33:44.0356923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0357473Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0357973Z             )
2025-05-07T20:33:44.0358298Z         else:
2025-05-07T20:33:44.0358636Z             scale_ub_tensor = None
2025-05-07T20:33:44.0359051Z     
2025-05-07T20:33:44.0359405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0359889Z             op = silu_mul_quant
2025-05-07T20:33:44.0360279Z             if compiled:
2025-05-07T20:33:44.0360674Z                 op = torch.compile(op)
2025-05-07T20:33:44.0361147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0361600Z     
2025-05-07T20:33:44.0361923Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0362392Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0362876Z     
2025-05-07T20:33:44.0363277Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0363826Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0364301Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0364822Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0365534Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0366147Z     
2025-05-07T20:33:44.0366495Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0366817Z 
2025-05-07T20:33:44.0366998Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0367487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0368036Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0368587Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0369841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0371042Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0371928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0373014Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0374108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0375238Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0376466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0377691Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0378892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0379933Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0380898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0381735Z     fn()
2025-05-07T20:33:44.0382549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0383499Z     self.fn.run(
2025-05-07T20:33:44.0384254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0385128Z     kernel = self.compile(
2025-05-07T20:33:44.0386015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0387082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0387731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0388188Z 
2025-05-07T20:33:44.0388556Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2afa7eb0>
2025-05-07T20:33:44.0390215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0392406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2b098af0>}
2025-05-07T20:33:44.0394669Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0396307Z context = <triton._C.libtriton.ir.context object at 0x7fae30ecce30>
2025-05-07T20:33:44.0396777Z 
2025-05-07T20:33:44.0397042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0397899Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0398649Z                            module_map=module_map)
2025-05-07T20:33:44.0399223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0399862Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0400339Z E       ^
2025-05-07T20:33:44.0401102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0401899Z 
2025-05-07T20:33:44.0402573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0403407Z 
2025-05-07T20:33:44.0403590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0404268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0404928Z     T=2048,
2025-05-07T20:33:44.0405252Z     D=5120,
2025-05-07T20:33:44.0405574Z     scale_ub=1200.0,
2025-05-07T20:33:44.0405957Z     contiguous=True,
2025-05-07T20:33:44.0406334Z     compiled=False,
2025-05-07T20:33:44.0406688Z )
2025-05-07T20:33:44.0407209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0408030Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.0408481Z 
2025-05-07T20:33:44.0408620Z     @given(
2025-05-07T20:33:44.0408993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0409516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0410014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0410533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0411067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0411517Z     )
2025-05-07T20:33:44.0412073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0412777Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0413166Z         self,
2025-05-07T20:33:44.0413479Z         T: int,
2025-05-07T20:33:44.0413788Z         D: int,
2025-05-07T20:33:44.0414133Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0414566Z         contiguous: bool,
2025-05-07T20:33:44.0414950Z         compiled: bool,
2025-05-07T20:33:44.0415314Z     ) -> None:
2025-05-07T20:33:44.0415679Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0416079Z     
2025-05-07T20:33:44.0416532Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0417097Z     
2025-05-07T20:33:44.0417417Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0417905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0418422Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0418816Z         x0 = x[:, :D]
2025-05-07T20:33:44.0419185Z         x1 = x[:, D:]
2025-05-07T20:33:44.0419536Z     
2025-05-07T20:33:44.0419841Z         if contiguous:
2025-05-07T20:33:44.0420296Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0420793Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0421222Z     
2025-05-07T20:33:44.0421546Z         if scale_ub is not None:
2025-05-07T20:33:44.0422008Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0422559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0423054Z             )
2025-05-07T20:33:44.0423373Z         else:
2025-05-07T20:33:44.0423715Z             scale_ub_tensor = None
2025-05-07T20:33:44.0424439Z     
2025-05-07T20:33:44.0424812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0425319Z             op = silu_mul_quant
2025-05-07T20:33:44.0425720Z             if compiled:
2025-05-07T20:33:44.0426122Z                 op = torch.compile(op)
2025-05-07T20:33:44.0426608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0427054Z     
2025-05-07T20:33:44.0427376Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0427648Z 
2025-05-07T20:33:44.0427815Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0428287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0428812Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0429260Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0430542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0431650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0432443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0433482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0434653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0435511Z     kernel = self.compile(
2025-05-07T20:33:44.0436400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0437445Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0438078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0438467Z 
2025-05-07T20:33:44.0438795Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2a089960>
2025-05-07T20:33:44.0440498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0442722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2af71990>}
2025-05-07T20:33:44.0444877Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0446452Z context = <triton._C.libtriton.ir.context object at 0x7fae2b680030>
2025-05-07T20:33:44.0446881Z 
2025-05-07T20:33:44.0447129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0447952Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0448714Z                            module_map=module_map)
2025-05-07T20:33:44.0449285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0449849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0450272Z E       ^
2025-05-07T20:33:44.0451009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0451731Z 
2025-05-07T20:33:44.0452400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0453414Z 
2025-05-07T20:33:44.0453590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0454261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0454904Z     T=2048,
2025-05-07T20:33:44.0455223Z     D=5120,
2025-05-07T20:33:44.0455491Z     scale_ub=1200.0,
2025-05-07T20:33:44.0455763Z     contiguous=True,
2025-05-07T20:33:44.0456034Z     compiled=True,
2025-05-07T20:33:44.0456284Z )
2025-05-07T20:33:44.0456657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0457233Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.0457547Z 
2025-05-07T20:33:44.0457649Z     @given(
2025-05-07T20:33:44.0457926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0458289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0458656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0459049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0459432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0459772Z     )
2025-05-07T20:33:44.0460186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0460824Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0461191Z         self,
2025-05-07T20:33:44.0461435Z         T: int,
2025-05-07T20:33:44.0461670Z         D: int,
2025-05-07T20:33:44.0461940Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0462268Z         contiguous: bool,
2025-05-07T20:33:44.0462556Z         compiled: bool,
2025-05-07T20:33:44.0462832Z     ) -> None:
2025-05-07T20:33:44.0463092Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0463383Z     
2025-05-07T20:33:44.0463698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0464105Z     
2025-05-07T20:33:44.0464348Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0464693Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0465078Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0465371Z         x0 = x[:, :D]
2025-05-07T20:33:44.0465628Z         x1 = x[:, D:]
2025-05-07T20:33:44.0465882Z     
2025-05-07T20:33:44.0466120Z         if contiguous:
2025-05-07T20:33:44.0466394Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0466710Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0467004Z     
2025-05-07T20:33:44.0467237Z         if scale_ub is not None:
2025-05-07T20:33:44.0467566Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0467965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0468328Z             )
2025-05-07T20:33:44.0468568Z         else:
2025-05-07T20:33:44.0468827Z             scale_ub_tensor = None
2025-05-07T20:33:44.0469127Z     
2025-05-07T20:33:44.0469410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0469797Z             op = silu_mul_quant
2025-05-07T20:33:44.0470097Z             if compiled:
2025-05-07T20:33:44.0470399Z                 op = torch.compile(op)
2025-05-07T20:33:44.0470782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0471145Z     
2025-05-07T20:33:44.0471380Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0471729Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0472078Z     
2025-05-07T20:33:44.0472358Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0472756Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0473107Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0473475Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0473977Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0474349Z     
2025-05-07T20:33:44.0474595Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0474837Z 
2025-05-07T20:33:44.0475043Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0475439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0475839Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0476226Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0477155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0478035Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0478670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0479463Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0480271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0481301Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0482327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0483200Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0484146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0484887Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0485580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0486188Z     fn()
2025-05-07T20:33:44.0486780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0487458Z     self.fn.run(
2025-05-07T20:33:44.0488004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0488637Z     kernel = self.compile(
2025-05-07T20:33:44.0489273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0490026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0490501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0490767Z 
2025-05-07T20:33:44.0491018Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2afa7b20>
2025-05-07T20:33:44.0492249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0493810Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a196c0>}
2025-05-07T20:33:44.0495349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0496519Z context = <triton._C.libtriton.ir.context object at 0x7fae2992da70>
2025-05-07T20:33:44.0496858Z 
2025-05-07T20:33:44.0497066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0497666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0498216Z                            module_map=module_map)
2025-05-07T20:33:44.0498649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0499069Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0499385Z E       ^
2025-05-07T20:33:44.0499926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0500496Z 
2025-05-07T20:33:44.0501045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0501677Z 
2025-05-07T20:33:44.0501809Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0502290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0502765Z     T=16384,
2025-05-07T20:33:44.0503006Z     D=7168,
2025-05-07T20:33:44.0503237Z     scale_ub=1200.0,
2025-05-07T20:33:44.0503509Z     contiguous=False,
2025-05-07T20:33:44.0503789Z     compiled=False,
2025-05-07T20:33:44.0504032Z )
2025-05-07T20:33:44.0504409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0504995Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.0505318Z 
2025-05-07T20:33:44.0505412Z     @given(
2025-05-07T20:33:44.0505691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0506069Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0506441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0506829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0507223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0507619Z     )
2025-05-07T20:33:44.0508066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0508597Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0508897Z         self,
2025-05-07T20:33:44.0509135Z         T: int,
2025-05-07T20:33:44.0509384Z         D: int,
2025-05-07T20:33:44.0509650Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0509969Z         contiguous: bool,
2025-05-07T20:33:44.0510261Z         compiled: bool,
2025-05-07T20:33:44.0510536Z     ) -> None:
2025-05-07T20:33:44.0510817Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0511137Z     
2025-05-07T20:33:44.0511463Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0511872Z     
2025-05-07T20:33:44.0512106Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0512460Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0512827Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0513111Z         x0 = x[:, :D]
2025-05-07T20:33:44.0513379Z         x1 = x[:, D:]
2025-05-07T20:33:44.0513683Z     
2025-05-07T20:33:44.0513905Z         if contiguous:
2025-05-07T20:33:44.0514190Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0514505Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0514789Z     
2025-05-07T20:33:44.0515031Z         if scale_ub is not None:
2025-05-07T20:33:44.0515362Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0515752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0516118Z             )
2025-05-07T20:33:44.0516353Z         else:
2025-05-07T20:33:44.0516607Z             scale_ub_tensor = None
2025-05-07T20:33:44.0516914Z     
2025-05-07T20:33:44.0517197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0517568Z             op = silu_mul_quant
2025-05-07T20:33:44.0517861Z             if compiled:
2025-05-07T20:33:44.0518157Z                 op = torch.compile(op)
2025-05-07T20:33:44.0518510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0518836Z     
2025-05-07T20:33:44.0519071Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0519266Z 
2025-05-07T20:33:44.0519389Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0519731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0520122Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0520456Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0521294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0522084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0522807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0523595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0524665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0525287Z     kernel = self.compile(
2025-05-07T20:33:44.0525917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0526674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0527131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0527400Z 
2025-05-07T20:33:44.0527641Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29d6b100>
2025-05-07T20:33:44.0528873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0530439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a18940>}
2025-05-07T20:33:44.0532132Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0533308Z context = <triton._C.libtriton.ir.context object at 0x7fae29996db0>
2025-05-07T20:33:44.0533645Z 
2025-05-07T20:33:44.0533841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0534446Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0534987Z                            module_map=module_map)
2025-05-07T20:33:44.0535420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0535834Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0536135Z E       ^
2025-05-07T20:33:44.0536674Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0537200Z 
2025-05-07T20:33:44.0537679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0538261Z 
2025-05-07T20:33:44.0538392Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0538869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0539340Z     T=1,
2025-05-07T20:33:44.0539563Z     D=7168,
2025-05-07T20:33:44.0539793Z     scale_ub=None,
2025-05-07T20:33:44.0540053Z     contiguous=True,
2025-05-07T20:33:44.0540322Z     compiled=True,
2025-05-07T20:33:44.0540574Z )
2025-05-07T20:33:44.0540942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0541504Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0541802Z 
2025-05-07T20:33:44.0541901Z     @given(
2025-05-07T20:33:44.0542169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0542544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0542906Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0543287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0543671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0544008Z     )
2025-05-07T20:33:44.0544410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0544923Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0545212Z         self,
2025-05-07T20:33:44.0545448Z         T: int,
2025-05-07T20:33:44.0545679Z         D: int,
2025-05-07T20:33:44.0546016Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0546335Z         contiguous: bool,
2025-05-07T20:33:44.0546704Z         compiled: bool,
2025-05-07T20:33:44.0546970Z     ) -> None:
2025-05-07T20:33:44.0547228Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0547508Z     
2025-05-07T20:33:44.0547830Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0548239Z     
2025-05-07T20:33:44.0548467Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0548807Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0549167Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0549446Z         x0 = x[:, :D]
2025-05-07T20:33:44.0549708Z         x1 = x[:, D:]
2025-05-07T20:33:44.0549956Z     
2025-05-07T20:33:44.0550175Z         if contiguous:
2025-05-07T20:33:44.0550456Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0550762Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0551048Z     
2025-05-07T20:33:44.0551274Z         if scale_ub is not None:
2025-05-07T20:33:44.0551650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0560117Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0560501Z             )
2025-05-07T20:33:44.0560735Z         else:
2025-05-07T20:33:44.0560999Z             scale_ub_tensor = None
2025-05-07T20:33:44.0561421Z     
2025-05-07T20:33:44.0561742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0562120Z             op = silu_mul_quant
2025-05-07T20:33:44.0562422Z             if compiled:
2025-05-07T20:33:44.0562718Z                 op = torch.compile(op)
2025-05-07T20:33:44.0563062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0563385Z     
2025-05-07T20:33:44.0563619Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0563951Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0564294Z     
2025-05-07T20:33:44.0564583Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0564975Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0565333Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0565703Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0566117Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0566485Z     
2025-05-07T20:33:44.0566725Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0566954Z 
2025-05-07T20:33:44.0567080Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0567423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0567816Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0568199Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0569092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0569947Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0570579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0571363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0572143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0572974Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0573838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0574695Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0575521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0576259Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0577054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0577648Z     fn()
2025-05-07T20:33:44.0578235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0578910Z     self.fn.run(
2025-05-07T20:33:44.0579455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0580061Z     kernel = self.compile(
2025-05-07T20:33:44.0580685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0581438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0581897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0582166Z 
2025-05-07T20:33:44.0582407Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29bb0f70>
2025-05-07T20:33:44.0583639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0585235Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b0790>}
2025-05-07T20:33:44.0586788Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0587948Z context = <triton._C.libtriton.ir.context object at 0x7fae298e8cb0>
2025-05-07T20:33:44.0588278Z 
2025-05-07T20:33:44.0588470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0589068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0589615Z                            module_map=module_map)
2025-05-07T20:33:44.0590030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0590450Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0590767Z E       ^
2025-05-07T20:33:44.0591306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0591876Z 
2025-05-07T20:33:44.0592376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0592968Z 
2025-05-07T20:33:44.0593093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0593635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0594100Z     T=4096,
2025-05-07T20:33:44.0594322Z     D=5120,
2025-05-07T20:33:44.0594554Z     scale_ub=None,
2025-05-07T20:33:44.0594815Z     contiguous=False,
2025-05-07T20:33:44.0595080Z     compiled=False,
2025-05-07T20:33:44.0595328Z )
2025-05-07T20:33:44.0595705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0596273Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.0596601Z 
2025-05-07T20:33:44.0596696Z     @given(
2025-05-07T20:33:44.0596971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0597334Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0597693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0598081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0598470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0598801Z     )
2025-05-07T20:33:44.0599211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0599729Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0600071Z         self,
2025-05-07T20:33:44.0600304Z         T: int,
2025-05-07T20:33:44.0600539Z         D: int,
2025-05-07T20:33:44.0600840Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0601163Z         contiguous: bool,
2025-05-07T20:33:44.0601450Z         compiled: bool,
2025-05-07T20:33:44.0601712Z     ) -> None:
2025-05-07T20:33:44.0601975Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0602261Z     
2025-05-07T20:33:44.0602579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0602976Z     
2025-05-07T20:33:44.0603207Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0603543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0603908Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0604197Z         x0 = x[:, :D]
2025-05-07T20:33:44.0604458Z         x1 = x[:, D:]
2025-05-07T20:33:44.0604700Z     
2025-05-07T20:33:44.0604927Z         if contiguous:
2025-05-07T20:33:44.0605206Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0605506Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0605791Z     
2025-05-07T20:33:44.0606031Z         if scale_ub is not None:
2025-05-07T20:33:44.0606353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0606749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0607166Z             )
2025-05-07T20:33:44.0607395Z         else:
2025-05-07T20:33:44.0607691Z             scale_ub_tensor = None
2025-05-07T20:33:44.0607995Z     
2025-05-07T20:33:44.0608264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0608630Z             op = silu_mul_quant
2025-05-07T20:33:44.0608925Z             if compiled:
2025-05-07T20:33:44.0609212Z                 op = torch.compile(op)
2025-05-07T20:33:44.0609561Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0609887Z     
2025-05-07T20:33:44.0610113Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0610313Z 
2025-05-07T20:33:44.0610431Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0610782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0611178Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0611506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0612301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0613102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0613715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0614498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0615259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0615873Z     kernel = self.compile(
2025-05-07T20:33:44.0616493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0617251Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0617712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0617975Z 
2025-05-07T20:33:44.0618218Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29794c10>
2025-05-07T20:33:44.0619444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0620998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b1510>}
2025-05-07T20:33:44.0622524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0624015Z context = <triton._C.libtriton.ir.context object at 0x7fae296b7a70>
2025-05-07T20:33:44.0624432Z 
2025-05-07T20:33:44.0624636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0625235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0625790Z                            module_map=module_map)
2025-05-07T20:33:44.0626216Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0626626Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0626933Z E       ^
2025-05-07T20:33:44.0627475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0627993Z 
2025-05-07T20:33:44.0628474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0629061Z 
2025-05-07T20:33:44.0629186Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0629671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0630137Z     T=4096,
2025-05-07T20:33:44.0630357Z     D=7168,
2025-05-07T20:33:44.0630588Z     scale_ub=None,
2025-05-07T20:33:44.0630952Z     contiguous=False,
2025-05-07T20:33:44.0631276Z     compiled=False,
2025-05-07T20:33:44.0631522Z )
2025-05-07T20:33:44.0631896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0632463Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.0632782Z 
2025-05-07T20:33:44.0632874Z     @given(
2025-05-07T20:33:44.0633142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0633566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0633930Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0634316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0634712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0635047Z     )
2025-05-07T20:33:44.0635454Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0635967Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0636253Z         self,
2025-05-07T20:33:44.0636484Z         T: int,
2025-05-07T20:33:44.0636724Z         D: int,
2025-05-07T20:33:44.0636975Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0637296Z         contiguous: bool,
2025-05-07T20:33:44.0637582Z         compiled: bool,
2025-05-07T20:33:44.0637844Z     ) -> None:
2025-05-07T20:33:44.0638093Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0638375Z     
2025-05-07T20:33:44.0638691Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0639087Z     
2025-05-07T20:33:44.0639320Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0639661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0640021Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0640311Z         x0 = x[:, :D]
2025-05-07T20:33:44.0640566Z         x1 = x[:, D:]
2025-05-07T20:33:44.0640807Z     
2025-05-07T20:33:44.0641029Z         if contiguous:
2025-05-07T20:33:44.0641302Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0641605Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0641892Z     
2025-05-07T20:33:44.0642127Z         if scale_ub is not None:
2025-05-07T20:33:44.0642447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0642841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0643203Z             )
2025-05-07T20:33:44.0643430Z         else:
2025-05-07T20:33:44.0643672Z             scale_ub_tensor = None
2025-05-07T20:33:44.0643964Z     
2025-05-07T20:33:44.0644239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0644599Z             op = silu_mul_quant
2025-05-07T20:33:44.0644893Z             if compiled:
2025-05-07T20:33:44.0645305Z                 op = torch.compile(op)
2025-05-07T20:33:44.0645713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0646037Z     
2025-05-07T20:33:44.0646271Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0646464Z 
2025-05-07T20:33:44.0646580Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0646931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0647322Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0647649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0648444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0649231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0649848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0650625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0651394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0652005Z     kernel = self.compile(
2025-05-07T20:33:44.0652630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0653505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0653964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0654229Z 
2025-05-07T20:33:44.0654472Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29557760>
2025-05-07T20:33:44.0655691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0657259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae297b1bd0>}
2025-05-07T20:33:44.0658787Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0659962Z context = <triton._C.libtriton.ir.context object at 0x7fae29666c30>
2025-05-07T20:33:44.0660293Z 
2025-05-07T20:33:44.0660491Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0661083Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0661626Z                            module_map=module_map)
2025-05-07T20:33:44.0662052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0662454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0662762Z E       ^
2025-05-07T20:33:44.0663303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0663816Z 
2025-05-07T20:33:44.0664294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0664880Z 
2025-05-07T20:33:44.0665007Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0665489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0665950Z     T=128,
2025-05-07T20:33:44.0666169Z     D=7168,
2025-05-07T20:33:44.0666399Z     scale_ub=None,
2025-05-07T20:33:44.0666653Z     contiguous=False,
2025-05-07T20:33:44.0666917Z     compiled=True,
2025-05-07T20:33:44.0667151Z )
2025-05-07T20:33:44.0667520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0668085Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.0668451Z 
2025-05-07T20:33:44.0668545Z     @given(
2025-05-07T20:33:44.0668862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0669229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0669583Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0669972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0670363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0670704Z     )
2025-05-07T20:33:44.0671103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0671613Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0671901Z         self,
2025-05-07T20:33:44.0672131Z         T: int,
2025-05-07T20:33:44.0672367Z         D: int,
2025-05-07T20:33:44.0672625Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0672946Z         contiguous: bool,
2025-05-07T20:33:44.0673225Z         compiled: bool,
2025-05-07T20:33:44.0673488Z     ) -> None:
2025-05-07T20:33:44.0673794Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0674075Z     
2025-05-07T20:33:44.0674400Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0674790Z     
2025-05-07T20:33:44.0675023Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0675361Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0675780Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0676109Z         x0 = x[:, :D]
2025-05-07T20:33:44.0676366Z         x1 = x[:, D:]
2025-05-07T20:33:44.0676607Z     
2025-05-07T20:33:44.0676829Z         if contiguous:
2025-05-07T20:33:44.0677101Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0677402Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0677680Z     
2025-05-07T20:33:44.0677907Z         if scale_ub is not None:
2025-05-07T20:33:44.0678231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0678615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0678978Z             )
2025-05-07T20:33:44.0679207Z         else:
2025-05-07T20:33:44.0679451Z             scale_ub_tensor = None
2025-05-07T20:33:44.0679745Z     
2025-05-07T20:33:44.0680015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0680375Z             op = silu_mul_quant
2025-05-07T20:33:44.0680668Z             if compiled:
2025-05-07T20:33:44.0680963Z                 op = torch.compile(op)
2025-05-07T20:33:44.0681304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0681627Z     
2025-05-07T20:33:44.0681858Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0682187Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0682530Z     
2025-05-07T20:33:44.0682810Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0683197Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0683538Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0683906Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0684330Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0684693Z     
2025-05-07T20:33:44.0684939Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0685167Z 
2025-05-07T20:33:44.0685292Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0685635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0686037Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0686419Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0687314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0687435Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0687846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0688109Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0688626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0688928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0689385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0689681Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0690114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0690308Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0690698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0690797Z     fn()
2025-05-07T20:33:44.0691307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0691413Z     self.fn.run(
2025-05-07T20:33:44.0691797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0691907Z     kernel = self.compile(
2025-05-07T20:33:44.0692433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0692637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0692788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0692799Z 
2025-05-07T20:33:44.0693037Z self = <triton.compiler.compiler.ASTSource object at 0x7fae292c7340>
2025-05-07T20:33:44.0693921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0694511Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae29a1a9e0>}
2025-05-07T20:33:44.0695356Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0695587Z context = <triton._C.libtriton.ir.context object at 0x7fae2914f230>
2025-05-07T20:33:44.0695592Z 
2025-05-07T20:33:44.0695784Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0696090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0696220Z                            module_map=module_map)
2025-05-07T20:33:44.0696408Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0696540Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0696636Z E       ^
2025-05-07T20:33:44.0697042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0697048Z 
2025-05-07T20:33:44.0697529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0697535Z 
2025-05-07T20:33:44.0697657Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0697910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0698007Z     T=128,
2025-05-07T20:33:44.0698096Z     D=7168,
2025-05-07T20:33:44.0698197Z     scale_ub=None,
2025-05-07T20:33:44.0698303Z     contiguous=False,
2025-05-07T20:33:44.0698402Z     compiled=False,
2025-05-07T20:33:44.0698494Z )
2025-05-07T20:33:44.0698745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0698993Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.0699042Z 
2025-05-07T20:33:44.0699141Z     @given(
2025-05-07T20:33:44.0699280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0699398Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0699542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0699684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0699825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0699913Z     )
2025-05-07T20:33:44.0700197Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0700312Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0700404Z         self,
2025-05-07T20:33:44.0700495Z         T: int,
2025-05-07T20:33:44.0700592Z         D: int,
2025-05-07T20:33:44.0700708Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0700813Z         contiguous: bool,
2025-05-07T20:33:44.0700930Z         compiled: bool,
2025-05-07T20:33:44.0701040Z     ) -> None:
2025-05-07T20:33:44.0701173Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0701269Z     
2025-05-07T20:33:44.0701464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0701605Z     
2025-05-07T20:33:44.0701713Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0701925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0702036Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0702132Z         x0 = x[:, :D]
2025-05-07T20:33:44.0702226Z         x1 = x[:, D:]
2025-05-07T20:33:44.0702315Z     
2025-05-07T20:33:44.0702415Z         if contiguous:
2025-05-07T20:33:44.0702522Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0702630Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0702714Z     
2025-05-07T20:33:44.0702823Z         if scale_ub is not None:
2025-05-07T20:33:44.0702954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0703113Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0703216Z             )
2025-05-07T20:33:44.0703306Z         else:
2025-05-07T20:33:44.0703416Z             scale_ub_tensor = None
2025-05-07T20:33:44.0703511Z     
2025-05-07T20:33:44.0703660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0703771Z             op = silu_mul_quant
2025-05-07T20:33:44.0703878Z             if compiled:
2025-05-07T20:33:44.0703996Z                 op = torch.compile(op)
2025-05-07T20:33:44.0704122Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0704215Z     
2025-05-07T20:33:44.0704322Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0704327Z 
2025-05-07T20:33:44.0704443Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0704599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0704720Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0704844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0705420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0705536Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0705952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0706215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0706605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0706721Z     kernel = self.compile(
2025-05-07T20:33:44.0707160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0707368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0707517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0707573Z 
2025-05-07T20:33:44.0707853Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28fba950>
2025-05-07T20:33:44.0708741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0709320Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e560>}
2025-05-07T20:33:44.0710170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0710392Z context = <triton._C.libtriton.ir.context object at 0x7fae291d5530>
2025-05-07T20:33:44.0710397Z 
2025-05-07T20:33:44.0710597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0710903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0711030Z                            module_map=module_map)
2025-05-07T20:33:44.0711223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0711427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0711522Z E       ^
2025-05-07T20:33:44.0711934Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0711939Z 
2025-05-07T20:33:44.0712409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0712415Z 
2025-05-07T20:33:44.0712541Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0713411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0713506Z     T=4096,
2025-05-07T20:33:44.0713653Z     D=5120,
2025-05-07T20:33:44.0713755Z     scale_ub=1200.0,
2025-05-07T20:33:44.0713856Z     contiguous=True,
2025-05-07T20:33:44.0713959Z     compiled=False,
2025-05-07T20:33:44.0714045Z )
2025-05-07T20:33:44.0714301Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0714510Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.0714515Z 
2025-05-07T20:33:44.0714607Z     @given(
2025-05-07T20:33:44.0714750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0714867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0715002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0715143Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0715276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0715364Z     )
2025-05-07T20:33:44.0715653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0715770Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0715867Z         self,
2025-05-07T20:33:44.0715956Z         T: int,
2025-05-07T20:33:44.0716047Z         D: int,
2025-05-07T20:33:44.0716167Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0716275Z         contiguous: bool,
2025-05-07T20:33:44.0716379Z         compiled: bool,
2025-05-07T20:33:44.0716479Z     ) -> None:
2025-05-07T20:33:44.0716591Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0716678Z     
2025-05-07T20:33:44.0716879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0716967Z     
2025-05-07T20:33:44.0717075Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0717225Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0717331Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0717433Z         x0 = x[:, :D]
2025-05-07T20:33:44.0717528Z         x1 = x[:, D:]
2025-05-07T20:33:44.0717614Z     
2025-05-07T20:33:44.0717784Z         if contiguous:
2025-05-07T20:33:44.0717934Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0718042Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0718138Z     
2025-05-07T20:33:44.0718246Z         if scale_ub is not None:
2025-05-07T20:33:44.0718370Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0718540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0718630Z             )
2025-05-07T20:33:44.0718724Z         else:
2025-05-07T20:33:44.0718839Z             scale_ub_tensor = None
2025-05-07T20:33:44.0718925Z     
2025-05-07T20:33:44.0719079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0719186Z             op = silu_mul_quant
2025-05-07T20:33:44.0719288Z             if compiled:
2025-05-07T20:33:44.0719410Z                 op = torch.compile(op)
2025-05-07T20:33:44.0719535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0719622Z     
2025-05-07T20:33:44.0719738Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0719743Z 
2025-05-07T20:33:44.0719858Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0720008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0720133Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0720251Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0720916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0721040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0734988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0735280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0735685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0735804Z     kernel = self.compile(
2025-05-07T20:33:44.0736261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0736467Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0736617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0736627Z 
2025-05-07T20:33:44.0736877Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29164d90>
2025-05-07T20:33:44.0737759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0738349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e7a0>}
2025-05-07T20:33:44.0739191Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0739419Z context = <triton._C.libtriton.ir.context object at 0x7fae291da670>
2025-05-07T20:33:44.0739424Z 
2025-05-07T20:33:44.0739622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0739925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0740057Z                            module_map=module_map)
2025-05-07T20:33:44.0740243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0740361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0740457Z E       ^
2025-05-07T20:33:44.0740870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0740876Z 
2025-05-07T20:33:44.0741538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0741544Z 
2025-05-07T20:33:44.0741671Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0741926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0742028Z     T=1,
2025-05-07T20:33:44.0742121Z     D=5120,
2025-05-07T20:33:44.0742225Z     scale_ub=None,
2025-05-07T20:33:44.0742334Z     contiguous=True,
2025-05-07T20:33:44.0742433Z     compiled=True,
2025-05-07T20:33:44.0742522Z )
2025-05-07T20:33:44.0742779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0742965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0742970Z 
2025-05-07T20:33:44.0743070Z     @given(
2025-05-07T20:33:44.0743211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0743329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0743474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0743613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0743746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0743843Z     )
2025-05-07T20:33:44.0744126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0744371Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0744470Z         self,
2025-05-07T20:33:44.0744562Z         T: int,
2025-05-07T20:33:44.0744658Z         D: int,
2025-05-07T20:33:44.0744775Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0744882Z         contiguous: bool,
2025-05-07T20:33:44.0744989Z         compiled: bool,
2025-05-07T20:33:44.0745086Z     ) -> None:
2025-05-07T20:33:44.0745198Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0745292Z     
2025-05-07T20:33:44.0745490Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0745579Z     
2025-05-07T20:33:44.0745697Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0745846Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0745952Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0746055Z         x0 = x[:, :D]
2025-05-07T20:33:44.0746153Z         x1 = x[:, D:]
2025-05-07T20:33:44.0746240Z     
2025-05-07T20:33:44.0746348Z         if contiguous:
2025-05-07T20:33:44.0746459Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0746570Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0746658Z     
2025-05-07T20:33:44.0746767Z         if scale_ub is not None:
2025-05-07T20:33:44.0746895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0747053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0747142Z             )
2025-05-07T20:33:44.0747238Z         else:
2025-05-07T20:33:44.0747348Z             scale_ub_tensor = None
2025-05-07T20:33:44.0747435Z     
2025-05-07T20:33:44.0747595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0747705Z             op = silu_mul_quant
2025-05-07T20:33:44.0747809Z             if compiled:
2025-05-07T20:33:44.0747931Z                 op = torch.compile(op)
2025-05-07T20:33:44.0748055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0748146Z     
2025-05-07T20:33:44.0748256Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0748401Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0748493Z     
2025-05-07T20:33:44.0748652Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0748771Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0748897Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0749039Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0749202Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0749292Z     
2025-05-07T20:33:44.0749411Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0749469Z 
2025-05-07T20:33:44.0749593Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0749787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0749915Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0750080Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0750719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0750838Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0751254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0751512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0751930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0752226Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0752685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0752978Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0753495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0753782Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0754177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0754269Z     fn()
2025-05-07T20:33:44.0754730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0754834Z     self.fn.run(
2025-05-07T20:33:44.0755219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0755337Z     kernel = self.compile(
2025-05-07T20:33:44.0755776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0755979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0756137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0756143Z 
2025-05-07T20:33:44.0756378Z self = <triton.compiler.compiler.ASTSource object at 0x7fae29037fa0>
2025-05-07T20:33:44.0757260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0757841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2982e050>}
2025-05-07T20:33:44.0758684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0758914Z context = <triton._C.libtriton.ir.context object at 0x7fae28a9e470>
2025-05-07T20:33:44.0758922Z 
2025-05-07T20:33:44.0759114Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0759421Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0759547Z                            module_map=module_map)
2025-05-07T20:33:44.0759736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0759861Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0759952Z E       ^
2025-05-07T20:33:44.0760355Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0760414Z 
2025-05-07T20:33:44.0760931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0760937Z 
2025-05-07T20:33:44.0761062Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0761326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0761418Z     T=2048,
2025-05-07T20:33:44.0761510Z     D=5120,
2025-05-07T20:33:44.0761611Z     scale_ub=None,
2025-05-07T20:33:44.0761713Z     contiguous=True,
2025-05-07T20:33:44.0761813Z     compiled=True,
2025-05-07T20:33:44.0761905Z )
2025-05-07T20:33:44.0762154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0762352Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0762357Z 
2025-05-07T20:33:44.0762451Z     @given(
2025-05-07T20:33:44.0762590Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0762719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0762856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0762993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0763130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0763271Z     )
2025-05-07T20:33:44.0763597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0763716Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0763807Z         self,
2025-05-07T20:33:44.0763898Z         T: int,
2025-05-07T20:33:44.0763992Z         D: int,
2025-05-07T20:33:44.0764108Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0764213Z         contiguous: bool,
2025-05-07T20:33:44.0764318Z         compiled: bool,
2025-05-07T20:33:44.0764410Z     ) -> None:
2025-05-07T20:33:44.0764526Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0764612Z     
2025-05-07T20:33:44.0764811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0764907Z     
2025-05-07T20:33:44.0765016Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0765161Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0765271Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0765371Z         x0 = x[:, :D]
2025-05-07T20:33:44.0765466Z         x1 = x[:, D:]
2025-05-07T20:33:44.0765560Z     
2025-05-07T20:33:44.0765660Z         if contiguous:
2025-05-07T20:33:44.0765769Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0765879Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0765965Z     
2025-05-07T20:33:44.0766078Z         if scale_ub is not None:
2025-05-07T20:33:44.0766202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0766360Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0766454Z             )
2025-05-07T20:33:44.0766543Z         else:
2025-05-07T20:33:44.0766653Z             scale_ub_tensor = None
2025-05-07T20:33:44.0766747Z     
2025-05-07T20:33:44.0766900Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0767008Z             op = silu_mul_quant
2025-05-07T20:33:44.0767113Z             if compiled:
2025-05-07T20:33:44.0767232Z                 op = torch.compile(op)
2025-05-07T20:33:44.0767360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0767451Z     
2025-05-07T20:33:44.0767560Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0767705Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0767793Z     
2025-05-07T20:33:44.0767951Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0768074Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0768192Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0768333Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0768501Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0768678Z     
2025-05-07T20:33:44.0768795Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0768841Z 
2025-05-07T20:33:44.0768964Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0769113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0769238Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0769402Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0770037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0770161Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0770571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0770826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0771247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0771547Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0772006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0772382Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0772809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0773006Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0773397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0773494Z     fn()
2025-05-07T20:33:44.0773954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0774055Z     self.fn.run(
2025-05-07T20:33:44.0774447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0774557Z     kernel = self.compile(
2025-05-07T20:33:44.0774990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0775204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0775353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0775358Z 
2025-05-07T20:33:44.0775601Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28a4a890>
2025-05-07T20:33:44.0776482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0777062Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae292e77f0>}
2025-05-07T20:33:44.0777906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0778132Z context = <triton._C.libtriton.ir.context object at 0x7fae28e7ef30>
2025-05-07T20:33:44.0778138Z 
2025-05-07T20:33:44.0778333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0778634Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0778764Z                            module_map=module_map)
2025-05-07T20:33:44.0778953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0779072Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0779217Z E       ^
2025-05-07T20:33:44.0779666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0779672Z 
2025-05-07T20:33:44.0780143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0780150Z 
2025-05-07T20:33:44.0780279Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0780535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0780630Z     T=128,
2025-05-07T20:33:44.0780723Z     D=5120,
2025-05-07T20:33:44.0780820Z     scale_ub=None,
2025-05-07T20:33:44.0780927Z     contiguous=True,
2025-05-07T20:33:44.0781026Z     compiled=True,
2025-05-07T20:33:44.0781133Z )
2025-05-07T20:33:44.0781418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0781612Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0781620Z 
2025-05-07T20:33:44.0781711Z     @given(
2025-05-07T20:33:44.0781854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0781971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0782112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0782295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0782469Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0782563Z     )
2025-05-07T20:33:44.0782848Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0782958Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0783054Z         self,
2025-05-07T20:33:44.0783145Z         T: int,
2025-05-07T20:33:44.0783234Z         D: int,
2025-05-07T20:33:44.0783355Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0783459Z         contiguous: bool,
2025-05-07T20:33:44.0783561Z         compiled: bool,
2025-05-07T20:33:44.0783655Z     ) -> None:
2025-05-07T20:33:44.0783771Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0783862Z     
2025-05-07T20:33:44.0784060Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0784149Z     
2025-05-07T20:33:44.0784261Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0784405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0784518Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0784616Z         x0 = x[:, :D]
2025-05-07T20:33:44.0784711Z         x1 = x[:, D:]
2025-05-07T20:33:44.0784799Z     
2025-05-07T20:33:44.0784903Z         if contiguous:
2025-05-07T20:33:44.0785011Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0785118Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0785206Z     
2025-05-07T20:33:44.0785317Z         if scale_ub is not None:
2025-05-07T20:33:44.0785444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0785602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0785696Z             )
2025-05-07T20:33:44.0785792Z         else:
2025-05-07T20:33:44.0785905Z             scale_ub_tensor = None
2025-05-07T20:33:44.0785991Z     
2025-05-07T20:33:44.0786144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0786251Z             op = silu_mul_quant
2025-05-07T20:33:44.0786354Z             if compiled:
2025-05-07T20:33:44.0786476Z                 op = torch.compile(op)
2025-05-07T20:33:44.0786599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0786686Z     
2025-05-07T20:33:44.0786796Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0786938Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0787028Z     
2025-05-07T20:33:44.0787187Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0787309Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0787429Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0787569Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0787788Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0787921Z     
2025-05-07T20:33:44.0788041Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0788046Z 
2025-05-07T20:33:44.0788162Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0788317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0788445Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0788608Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0789244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0789363Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0789783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0790039Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0790470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0790765Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0791327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0791665Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0792093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0792288Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0792684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0792777Z     fn()
2025-05-07T20:33:44.0793250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0793348Z     self.fn.run(
2025-05-07T20:33:44.0793792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0793908Z     kernel = self.compile(
2025-05-07T20:33:44.0794348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0794551Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0794701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0794706Z 
2025-05-07T20:33:44.0794943Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28a382b0>
2025-05-07T20:33:44.0795833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0796413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c24280>}
2025-05-07T20:33:44.0797261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0797487Z context = <triton._C.libtriton.ir.context object at 0x7fae28ceac30>
2025-05-07T20:33:44.0797493Z 
2025-05-07T20:33:44.0797684Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0797990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0798115Z                            module_map=module_map)
2025-05-07T20:33:44.0798305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0798477Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0798612Z E       ^
2025-05-07T20:33:44.0799021Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0799026Z 
2025-05-07T20:33:44.0799498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0799503Z 
2025-05-07T20:33:44.0799626Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0799886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0799978Z     T=4096,
2025-05-07T20:33:44.0800073Z     D=5120,
2025-05-07T20:33:44.0800169Z     scale_ub=None,
2025-05-07T20:33:44.0800269Z     contiguous=True,
2025-05-07T20:33:44.0800370Z     compiled=True,
2025-05-07T20:33:44.0800458Z )
2025-05-07T20:33:44.0800706Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0800912Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0800917Z 
2025-05-07T20:33:44.0801008Z     @given(
2025-05-07T20:33:44.0801146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0801267Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0801451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0801657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0801793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0801880Z     )
2025-05-07T20:33:44.0802167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0802279Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0802370Z         self,
2025-05-07T20:33:44.0802466Z         T: int,
2025-05-07T20:33:44.0802558Z         D: int,
2025-05-07T20:33:44.0802675Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0802784Z         contiguous: bool,
2025-05-07T20:33:44.0802890Z         compiled: bool,
2025-05-07T20:33:44.0802986Z     ) -> None:
2025-05-07T20:33:44.0803102Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0803189Z     
2025-05-07T20:33:44.0803387Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0803479Z     
2025-05-07T20:33:44.0803586Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0803738Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0803844Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0803939Z         x0 = x[:, :D]
2025-05-07T20:33:44.0804039Z         x1 = x[:, D:]
2025-05-07T20:33:44.0804126Z     
2025-05-07T20:33:44.0804225Z         if contiguous:
2025-05-07T20:33:44.0804340Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0804445Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0804531Z     
2025-05-07T20:33:44.0804641Z         if scale_ub is not None:
2025-05-07T20:33:44.0804765Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0804930Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0805025Z             )
2025-05-07T20:33:44.0805116Z         else:
2025-05-07T20:33:44.0805228Z             scale_ub_tensor = None
2025-05-07T20:33:44.0805314Z     
2025-05-07T20:33:44.0805465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0805577Z             op = silu_mul_quant
2025-05-07T20:33:44.0805679Z             if compiled:
2025-05-07T20:33:44.0805800Z                 op = torch.compile(op)
2025-05-07T20:33:44.0805930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0806018Z     
2025-05-07T20:33:44.0806123Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0806268Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0806360Z     
2025-05-07T20:33:44.0806519Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0806645Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0806816Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0807006Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0807172Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0807259Z     
2025-05-07T20:33:44.0807381Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0807388Z 
2025-05-07T20:33:44.0807504Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0807658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0807785Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0807943Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0808589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0808708Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0809121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0809385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0809804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0810098Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0810644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0810934Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0811369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0811564Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0811956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0812053Z     fn()
2025-05-07T20:33:44.0812514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0812616Z     self.fn.run(
2025-05-07T20:33:44.0813005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0813126Z     kernel = self.compile(
2025-05-07T20:33:44.0813564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0813767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0813914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0813922Z 
2025-05-07T20:33:44.0814158Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28b81a50>
2025-05-07T20:33:44.0815042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0815622Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c252d0>}
2025-05-07T20:33:44.0816471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0816697Z context = <triton._C.libtriton.ir.context object at 0x7fae284d4e70>
2025-05-07T20:33:44.0816702Z 
2025-05-07T20:33:44.0816894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0817201Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0817388Z                            module_map=module_map)
2025-05-07T20:33:44.0817616Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0817736Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0817834Z E       ^
2025-05-07T20:33:44.0818239Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0818249Z 
2025-05-07T20:33:44.0818720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0818725Z 
2025-05-07T20:33:44.0818849Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0819103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0819199Z     T=16384,
2025-05-07T20:33:44.0819291Z     D=5120,
2025-05-07T20:33:44.0819398Z     scale_ub=None,
2025-05-07T20:33:44.0819498Z     contiguous=True,
2025-05-07T20:33:44.0819596Z     compiled=True,
2025-05-07T20:33:44.0819690Z )
2025-05-07T20:33:44.0819943Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0820145Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.0820150Z 
2025-05-07T20:33:44.0820249Z     @given(
2025-05-07T20:33:44.0820436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0820592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0820732Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0820879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0821038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0821146Z     )
2025-05-07T20:33:44.0821427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0821540Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0821632Z         self,
2025-05-07T20:33:44.0821723Z         T: int,
2025-05-07T20:33:44.0821822Z         D: int,
2025-05-07T20:33:44.0821940Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0822047Z         contiguous: bool,
2025-05-07T20:33:44.0822152Z         compiled: bool,
2025-05-07T20:33:44.0822245Z     ) -> None:
2025-05-07T20:33:44.0822358Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0822449Z     
2025-05-07T20:33:44.0822652Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0822739Z     
2025-05-07T20:33:44.0822854Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0822998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0823105Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0823199Z         x0 = x[:, :D]
2025-05-07T20:33:44.0823292Z         x1 = x[:, D:]
2025-05-07T20:33:44.0823385Z     
2025-05-07T20:33:44.0823484Z         if contiguous:
2025-05-07T20:33:44.0823594Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0823706Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0824011Z     
2025-05-07T20:33:44.0824177Z         if scale_ub is not None:
2025-05-07T20:33:44.0824354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0824515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0824607Z             )
2025-05-07T20:33:44.0824701Z         else:
2025-05-07T20:33:44.0824812Z             scale_ub_tensor = None
2025-05-07T20:33:44.0824906Z     
2025-05-07T20:33:44.0825063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0825171Z             op = silu_mul_quant
2025-05-07T20:33:44.0825276Z             if compiled:
2025-05-07T20:33:44.0825392Z                 op = torch.compile(op)
2025-05-07T20:33:44.0825516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0825605Z     
2025-05-07T20:33:44.0825713Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0825854Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0825945Z     
2025-05-07T20:33:44.0826101Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0826313Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0826515Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0826660Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0826832Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0826920Z     
2025-05-07T20:33:44.0827041Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0827047Z 
2025-05-07T20:33:44.0827166Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0827315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0827439Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0827605Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0828238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0828360Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0828775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0829034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0829458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0829879Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0830342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0830636Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0831065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0831262Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0831661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0831754Z     fn()
2025-05-07T20:33:44.0832218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0832317Z     self.fn.run(
2025-05-07T20:33:44.0832715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0832824Z     kernel = self.compile(
2025-05-07T20:33:44.0833257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0833466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0833736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0833742Z 
2025-05-07T20:33:44.0833975Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28385150>
2025-05-07T20:33:44.0834869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0835450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28c360e0>}
2025-05-07T20:33:44.0836296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0836518Z context = <triton._C.libtriton.ir.context object at 0x7fad07db1530>
2025-05-07T20:33:44.0836523Z 
2025-05-07T20:33:44.0836718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0837073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0837240Z                            module_map=module_map)
2025-05-07T20:33:44.0837435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0837557Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0837652Z E       ^
2025-05-07T20:33:44.0838064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0838069Z 
2025-05-07T20:33:44.0838541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0838546Z 
2025-05-07T20:33:44.0838670Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0838923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0839015Z     T=1,
2025-05-07T20:33:44.0839110Z     D=5120,
2025-05-07T20:33:44.0839209Z     scale_ub=1200.0,
2025-05-07T20:33:44.0839313Z     contiguous=True,
2025-05-07T20:33:44.0839421Z     compiled=True,
2025-05-07T20:33:44.0839509Z )
2025-05-07T20:33:44.0839760Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0839953Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.0840032Z 
2025-05-07T20:33:44.0840124Z     @given(
2025-05-07T20:33:44.0840311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0840431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0840565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0840709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0840847Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0840940Z     )
2025-05-07T20:33:44.0841275Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0841387Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0841486Z         self,
2025-05-07T20:33:44.0841578Z         T: int,
2025-05-07T20:33:44.0841670Z         D: int,
2025-05-07T20:33:44.0841787Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0841892Z         contiguous: bool,
2025-05-07T20:33:44.0841995Z         compiled: bool,
2025-05-07T20:33:44.0842092Z     ) -> None:
2025-05-07T20:33:44.0842205Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0842298Z     
2025-05-07T20:33:44.0842497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0842586Z     
2025-05-07T20:33:44.0842695Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0842847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0842952Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0843051Z         x0 = x[:, :D]
2025-05-07T20:33:44.0843147Z         x1 = x[:, D:]
2025-05-07T20:33:44.0843233Z     
2025-05-07T20:33:44.0843335Z         if contiguous:
2025-05-07T20:33:44.0843446Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0843558Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0843647Z     
2025-05-07T20:33:44.0843755Z         if scale_ub is not None:
2025-05-07T20:33:44.0843880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0844039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0844131Z             )
2025-05-07T20:33:44.0844223Z         else:
2025-05-07T20:33:44.0844342Z             scale_ub_tensor = None
2025-05-07T20:33:44.0844428Z     
2025-05-07T20:33:44.0844583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0844689Z             op = silu_mul_quant
2025-05-07T20:33:44.0844789Z             if compiled:
2025-05-07T20:33:44.0844909Z                 op = torch.compile(op)
2025-05-07T20:33:44.0845034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0845121Z     
2025-05-07T20:33:44.0845231Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0845235Z 
2025-05-07T20:33:44.0845350Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0845593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0845717Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0845834Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0846254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.0846370Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.0846930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0847049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0847456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0847712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0848104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0848217Z     kernel = self.compile(
2025-05-07T20:33:44.0848661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0848862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0849097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0849103Z 
2025-05-07T20:33:44.0849345Z self = <triton.compiler.compiler.ASTSource object at 0x7fae283551b0>
2025-05-07T20:33:44.0850227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0850830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28222560>}
2025-05-07T20:33:44.0851702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0851925Z context = <triton._C.libtriton.ir.context object at 0x7fae280ae0b0>
2025-05-07T20:33:44.0851939Z 
2025-05-07T20:33:44.0852132Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0852435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0852563Z                            module_map=module_map)
2025-05-07T20:33:44.0852750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0852870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0852965Z E       ^
2025-05-07T20:33:44.0853367Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0853376Z 
2025-05-07T20:33:44.0853851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0853856Z 
2025-05-07T20:33:44.0853979Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0854238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0854336Z     T=1,
2025-05-07T20:33:44.0854428Z     D=5120,
2025-05-07T20:33:44.0854525Z     scale_ub=None,
2025-05-07T20:33:44.0854630Z     contiguous=False,
2025-05-07T20:33:44.0854729Z     compiled=True,
2025-05-07T20:33:44.0854816Z )
2025-05-07T20:33:44.0855071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0855259Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.0855264Z 
2025-05-07T20:33:44.0855360Z     @given(
2025-05-07T20:33:44.0855499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0855666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0855850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0855988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0856121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0856214Z     )
2025-05-07T20:33:44.0856499Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0856613Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0856702Z         self,
2025-05-07T20:33:44.0856794Z         T: int,
2025-05-07T20:33:44.0856889Z         D: int,
2025-05-07T20:33:44.0857005Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0857110Z         contiguous: bool,
2025-05-07T20:33:44.0857213Z         compiled: bool,
2025-05-07T20:33:44.0857306Z     ) -> None:
2025-05-07T20:33:44.0857415Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0857505Z     
2025-05-07T20:33:44.0857698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0857788Z     
2025-05-07T20:33:44.0857902Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0858050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0858154Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0858251Z         x0 = x[:, :D]
2025-05-07T20:33:44.0858393Z         x1 = x[:, D:]
2025-05-07T20:33:44.0858523Z     
2025-05-07T20:33:44.0858623Z         if contiguous:
2025-05-07T20:33:44.0858730Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0858838Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0858925Z     
2025-05-07T20:33:44.0859031Z         if scale_ub is not None:
2025-05-07T20:33:44.0859160Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0859316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0859405Z             )
2025-05-07T20:33:44.0859500Z         else:
2025-05-07T20:33:44.0859610Z             scale_ub_tensor = None
2025-05-07T20:33:44.0859699Z     
2025-05-07T20:33:44.0859854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0859963Z             op = silu_mul_quant
2025-05-07T20:33:44.0860069Z             if compiled:
2025-05-07T20:33:44.0860188Z                 op = torch.compile(op)
2025-05-07T20:33:44.0860312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0860409Z     
2025-05-07T20:33:44.0860518Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0860660Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0860752Z     
2025-05-07T20:33:44.0860934Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0861079Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0861200Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0861342Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0861505Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0861600Z     
2025-05-07T20:33:44.0861719Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0861724Z 
2025-05-07T20:33:44.0861846Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0861995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0862119Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0862284Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0862923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0863041Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0863455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0863711Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0864130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0864515Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0864970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0865269Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0865697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0865892Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0866283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0866374Z     fn()
2025-05-07T20:33:44.0866835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0866935Z     self.fn.run(
2025-05-07T20:33:44.0867322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0867436Z     kernel = self.compile(
2025-05-07T20:33:44.0867868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0868158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0868305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0868310Z 
2025-05-07T20:33:44.0868543Z self = <triton.compiler.compiler.ASTSource object at 0x7fae28386b60>
2025-05-07T20:33:44.0869421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0869998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d47910>}
2025-05-07T20:33:44.0870849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0871077Z context = <triton._C.libtriton.ir.context object at 0x7fae280f7030>
2025-05-07T20:33:44.0871082Z 
2025-05-07T20:33:44.0871275Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0871578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0871704Z                            module_map=module_map)
2025-05-07T20:33:44.0871894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0872015Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.0872106Z E       ^
2025-05-07T20:33:44.0872520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0872525Z 
2025-05-07T20:33:44.0872994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0873002Z 
2025-05-07T20:33:44.0873129Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0873386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0873477Z     T=1,
2025-05-07T20:33:44.0873626Z     D=5120,
2025-05-07T20:33:44.0873726Z     scale_ub=None,
2025-05-07T20:33:44.0873825Z     contiguous=True,
2025-05-07T20:33:44.0873928Z     compiled=False,
2025-05-07T20:33:44.0874015Z )
2025-05-07T20:33:44.0874265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0874458Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.0874517Z 
2025-05-07T20:33:44.0874608Z     @given(
2025-05-07T20:33:44.0874816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0874935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0875070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0875212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0875351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0875439Z     )
2025-05-07T20:33:44.0875725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0875835Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0875931Z         self,
2025-05-07T20:33:44.0876021Z         T: int,
2025-05-07T20:33:44.0876112Z         D: int,
2025-05-07T20:33:44.0876230Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0876336Z         contiguous: bool,
2025-05-07T20:33:44.0876437Z         compiled: bool,
2025-05-07T20:33:44.0876533Z     ) -> None:
2025-05-07T20:33:44.0876647Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0876733Z     
2025-05-07T20:33:44.0876936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0877023Z     
2025-05-07T20:33:44.0877133Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0877283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0877437Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0877573Z         x0 = x[:, :D]
2025-05-07T20:33:44.0877674Z         x1 = x[:, D:]
2025-05-07T20:33:44.0877760Z     
2025-05-07T20:33:44.0877861Z         if contiguous:
2025-05-07T20:33:44.0877972Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0878077Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0878168Z     
2025-05-07T20:33:44.0878275Z         if scale_ub is not None:
2025-05-07T20:33:44.0878398Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0878558Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0878648Z             )
2025-05-07T20:33:44.0878742Z         else:
2025-05-07T20:33:44.0878855Z             scale_ub_tensor = None
2025-05-07T20:33:44.0878945Z     
2025-05-07T20:33:44.0879095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0879208Z             op = silu_mul_quant
2025-05-07T20:33:44.0883128Z             if compiled:
2025-05-07T20:33:44.0883272Z                 op = torch.compile(op)
2025-05-07T20:33:44.0883407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0883501Z     
2025-05-07T20:33:44.0883611Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0883617Z 
2025-05-07T20:33:44.0883734Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0883889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0884010Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0884128Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0884714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0884836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0885252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0885510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0885905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0886021Z     kernel = self.compile(
2025-05-07T20:33:44.0886459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0886667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0886815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0886820Z 
2025-05-07T20:33:44.0887059Z self = <triton.compiler.compiler.ASTSource object at 0x7fae2814f640>
2025-05-07T20:33:44.0888057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0888640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d47d00>}
2025-05-07T20:33:44.0889492Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0889718Z context = <triton._C.libtriton.ir.context object at 0x7fad07a681f0>
2025-05-07T20:33:44.0889723Z 
2025-05-07T20:33:44.0889913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0890223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0890355Z                            module_map=module_map)
2025-05-07T20:33:44.0890547Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0890669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0890759Z E       ^
2025-05-07T20:33:44.0891260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0891267Z 
2025-05-07T20:33:44.0891737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0891743Z 
2025-05-07T20:33:44.0891870Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0892125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0892216Z     T=128,
2025-05-07T20:33:44.0892308Z     D=5120,
2025-05-07T20:33:44.0892407Z     scale_ub=None,
2025-05-07T20:33:44.0892515Z     contiguous=False,
2025-05-07T20:33:44.0892618Z     compiled=True,
2025-05-07T20:33:44.0892707Z )
2025-05-07T20:33:44.0892957Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0893158Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.0893167Z 
2025-05-07T20:33:44.0893258Z     @given(
2025-05-07T20:33:44.0893405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0893522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0893657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0893799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0893931Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0894020Z     )
2025-05-07T20:33:44.0894308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0894418Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0894511Z         self,
2025-05-07T20:33:44.0894610Z         T: int,
2025-05-07T20:33:44.0894700Z         D: int,
2025-05-07T20:33:44.0894826Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0894933Z         contiguous: bool,
2025-05-07T20:33:44.0895032Z         compiled: bool,
2025-05-07T20:33:44.0895129Z     ) -> None:
2025-05-07T20:33:44.0895243Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0895330Z     
2025-05-07T20:33:44.0895534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0895624Z     
2025-05-07T20:33:44.0895731Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0895881Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0895986Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0896082Z         x0 = x[:, :D]
2025-05-07T20:33:44.0896178Z         x1 = x[:, D:]
2025-05-07T20:33:44.0896263Z     
2025-05-07T20:33:44.0896364Z         if contiguous:
2025-05-07T20:33:44.0896474Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0896630Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0896724Z     
2025-05-07T20:33:44.0896874Z         if scale_ub is not None:
2025-05-07T20:33:44.0896997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0897158Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0897249Z             )
2025-05-07T20:33:44.0897341Z         else:
2025-05-07T20:33:44.0897457Z             scale_ub_tensor = None
2025-05-07T20:33:44.0897546Z     
2025-05-07T20:33:44.0897697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0897808Z             op = silu_mul_quant
2025-05-07T20:33:44.0897906Z             if compiled:
2025-05-07T20:33:44.0898022Z                 op = torch.compile(op)
2025-05-07T20:33:44.0898151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0898240Z     
2025-05-07T20:33:44.0898351Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0898356Z 
2025-05-07T20:33:44.0898470Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0898624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0898749Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0898870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0899291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.0899447Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.0900059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0900175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0900589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0900847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0901236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0901353Z     kernel = self.compile(
2025-05-07T20:33:44.0901793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0901998Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0902159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0902164Z 
2025-05-07T20:33:44.0902401Z self = <triton.compiler.compiler.ASTSource object at 0x7fae280b5bd0>
2025-05-07T20:33:44.0903288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0903862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae28d44940>}
2025-05-07T20:33:44.0904711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0904932Z context = <triton._C.libtriton.ir.context object at 0x7fad07aa1730>
2025-05-07T20:33:44.0904941Z 
2025-05-07T20:33:44.0905135Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0905442Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0905570Z                            module_map=module_map)
2025-05-07T20:33:44.0905759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0905875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0905967Z E       ^
2025-05-07T20:33:44.0906374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0906429Z 
2025-05-07T20:33:44.0906944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0906949Z 
2025-05-07T20:33:44.0907072Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0907331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0907429Z     T=128,
2025-05-07T20:33:44.0907522Z     D=7168,
2025-05-07T20:33:44.0907620Z     scale_ub=1200.0,
2025-05-07T20:33:44.0907722Z     contiguous=False,
2025-05-07T20:33:44.0907825Z     compiled=False,
2025-05-07T20:33:44.0907913Z )
2025-05-07T20:33:44.0908161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0908365Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.0908370Z 
2025-05-07T20:33:44.0908461Z     @given(
2025-05-07T20:33:44.0908600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0908727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0908864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0909006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0909140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0909276Z     )
2025-05-07T20:33:44.0909606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0909718Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0909809Z         self,
2025-05-07T20:33:44.0909904Z         T: int,
2025-05-07T20:33:44.0909993Z         D: int,
2025-05-07T20:33:44.0910109Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0910220Z         contiguous: bool,
2025-05-07T20:33:44.0910321Z         compiled: bool,
2025-05-07T20:33:44.0910415Z     ) -> None:
2025-05-07T20:33:44.0910528Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0910614Z     
2025-05-07T20:33:44.0910811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0910901Z     
2025-05-07T20:33:44.0911012Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0911160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0911265Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0911359Z         x0 = x[:, :D]
2025-05-07T20:33:44.0911461Z         x1 = x[:, D:]
2025-05-07T20:33:44.0911549Z     
2025-05-07T20:33:44.0911650Z         if contiguous:
2025-05-07T20:33:44.0911761Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0911865Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0911951Z     
2025-05-07T20:33:44.0912062Z         if scale_ub is not None:
2025-05-07T20:33:44.0912185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0912346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0912435Z             )
2025-05-07T20:33:44.0912525Z         else:
2025-05-07T20:33:44.0912641Z             scale_ub_tensor = None
2025-05-07T20:33:44.0912733Z     
2025-05-07T20:33:44.0912884Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0912995Z             op = silu_mul_quant
2025-05-07T20:33:44.0913096Z             if compiled:
2025-05-07T20:33:44.0913212Z                 op = torch.compile(op)
2025-05-07T20:33:44.0913340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0913428Z     
2025-05-07T20:33:44.0913605Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0913616Z 
2025-05-07T20:33:44.0913730Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0913880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0914003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0914119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0914688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0914807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0915342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0915601Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0915991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0916103Z     kernel = self.compile(
2025-05-07T20:33:44.0916547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0916750Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0916897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0916902Z 
2025-05-07T20:33:44.0917139Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07a70ca0>
2025-05-07T20:33:44.0918019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0918596Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289abf40>}
2025-05-07T20:33:44.0919522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0919748Z context = <triton._C.libtriton.ir.context object at 0x7fad0795d630>
2025-05-07T20:33:44.0919753Z 
2025-05-07T20:33:44.0919948Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0920248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0920377Z                            module_map=module_map)
2025-05-07T20:33:44.0920567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0920685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0920780Z E       ^
2025-05-07T20:33:44.0921185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0921194Z 
2025-05-07T20:33:44.0921671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0921675Z 
2025-05-07T20:33:44.0921797Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0922052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0922150Z     T=128,
2025-05-07T20:33:44.0922243Z     D=5120,
2025-05-07T20:33:44.0922340Z     scale_ub=None,
2025-05-07T20:33:44.0922447Z     contiguous=False,
2025-05-07T20:33:44.0922545Z     compiled=False,
2025-05-07T20:33:44.0922634Z )
2025-05-07T20:33:44.0922888Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0923088Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.0923093Z 
2025-05-07T20:33:44.0923191Z     @given(
2025-05-07T20:33:44.0923330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0923451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0923591Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0923728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0924099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0924236Z     )
2025-05-07T20:33:44.0924535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0924650Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0924741Z         self,
2025-05-07T20:33:44.0924832Z         T: int,
2025-05-07T20:33:44.0924927Z         D: int,
2025-05-07T20:33:44.0925178Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0925282Z         contiguous: bool,
2025-05-07T20:33:44.0925454Z         compiled: bool,
2025-05-07T20:33:44.0925548Z     ) -> None:
2025-05-07T20:33:44.0925660Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0925750Z     
2025-05-07T20:33:44.0925945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0926036Z     
2025-05-07T20:33:44.0926150Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0926295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0926405Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0926500Z         x0 = x[:, :D]
2025-05-07T20:33:44.0926593Z         x1 = x[:, D:]
2025-05-07T20:33:44.0926683Z     
2025-05-07T20:33:44.0926781Z         if contiguous:
2025-05-07T20:33:44.0926888Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0926998Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0927083Z     
2025-05-07T20:33:44.0927188Z         if scale_ub is not None:
2025-05-07T20:33:44.0927317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0927479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0927569Z             )
2025-05-07T20:33:44.0927662Z         else:
2025-05-07T20:33:44.0927772Z             scale_ub_tensor = None
2025-05-07T20:33:44.0927936Z     
2025-05-07T20:33:44.0928156Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0928263Z             op = silu_mul_quant
2025-05-07T20:33:44.0928365Z             if compiled:
2025-05-07T20:33:44.0928481Z                 op = torch.compile(op)
2025-05-07T20:33:44.0928605Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0928693Z     
2025-05-07T20:33:44.0928802Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0928807Z 
2025-05-07T20:33:44.0928921Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0929072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0929190Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0929311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0929887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0930002Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0930427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0930686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0931078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0931193Z     kernel = self.compile(
2025-05-07T20:33:44.0931629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0931838Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0931988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0931996Z 
2025-05-07T20:33:44.0932232Z self = <triton.compiler.compiler.ASTSource object at 0x7fad079820e0>
2025-05-07T20:33:44.0933114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0933687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a95a0>}
2025-05-07T20:33:44.0934528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0934748Z context = <triton._C.libtriton.ir.context object at 0x7fad079f4eb0>
2025-05-07T20:33:44.0934802Z 
2025-05-07T20:33:44.0935034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0935343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0935469Z                            module_map=module_map)
2025-05-07T20:33:44.0935664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0935780Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0935871Z E       ^
2025-05-07T20:33:44.0936280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0936285Z 
2025-05-07T20:33:44.0936753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0936758Z 
2025-05-07T20:33:44.0936888Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0937144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0937239Z     T=128,
2025-05-07T20:33:44.0937335Z     D=5120,
2025-05-07T20:33:44.0937435Z     scale_ub=1200.0,
2025-05-07T20:33:44.0937534Z     contiguous=True,
2025-05-07T20:33:44.0937638Z     compiled=False,
2025-05-07T20:33:44.0937725Z )
2025-05-07T20:33:44.0938065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0938270Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.0938275Z 
2025-05-07T20:33:44.0938368Z     @given(
2025-05-07T20:33:44.0938510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0938627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0938761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0938902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0939034Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0939124Z     )
2025-05-07T20:33:44.0939410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0939521Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0939610Z         self,
2025-05-07T20:33:44.0939705Z         T: int,
2025-05-07T20:33:44.0939795Z         D: int,
2025-05-07T20:33:44.0939914Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0940032Z         contiguous: bool,
2025-05-07T20:33:44.0940134Z         compiled: bool,
2025-05-07T20:33:44.0940231Z     ) -> None:
2025-05-07T20:33:44.0940341Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0940428Z     
2025-05-07T20:33:44.0940626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0940712Z     
2025-05-07T20:33:44.0940820Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0940969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0941073Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0941168Z         x0 = x[:, :D]
2025-05-07T20:33:44.0941270Z         x1 = x[:, D:]
2025-05-07T20:33:44.0941355Z     
2025-05-07T20:33:44.0941456Z         if contiguous:
2025-05-07T20:33:44.0941565Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0941668Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0941758Z     
2025-05-07T20:33:44.0941863Z         if scale_ub is not None:
2025-05-07T20:33:44.0941990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0942152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0942240Z             )
2025-05-07T20:33:44.0942329Z         else:
2025-05-07T20:33:44.0942443Z             scale_ub_tensor = None
2025-05-07T20:33:44.0942531Z     
2025-05-07T20:33:44.0942682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0942792Z             op = silu_mul_quant
2025-05-07T20:33:44.0942892Z             if compiled:
2025-05-07T20:33:44.0943008Z                 op = torch.compile(op)
2025-05-07T20:33:44.0943135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0943275Z     
2025-05-07T20:33:44.0943385Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0943432Z 
2025-05-07T20:33:44.0943546Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0943696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0943820Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0943941Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0944509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0944629Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0945037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0945297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0945688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0945804Z     kernel = self.compile(
2025-05-07T20:33:44.0946247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0946449Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0946713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0946725Z 
2025-05-07T20:33:44.0946960Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07983700>
2025-05-07T20:33:44.0947839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0948420Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289abd90>}
2025-05-07T20:33:44.0949265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0949489Z context = <triton._C.libtriton.ir.context object at 0x7fad079183b0>
2025-05-07T20:33:44.0949497Z 
2025-05-07T20:33:44.0949690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0949992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0950120Z                            module_map=module_map)
2025-05-07T20:33:44.0950306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0950422Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0950517Z E       ^
2025-05-07T20:33:44.0950922Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0950930Z 
2025-05-07T20:33:44.0951407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0951411Z 
2025-05-07T20:33:44.0951534Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0951791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0951891Z     T=1,
2025-05-07T20:33:44.0951981Z     D=7168,
2025-05-07T20:33:44.0952083Z     scale_ub=1200.0,
2025-05-07T20:33:44.0952182Z     contiguous=True,
2025-05-07T20:33:44.0952279Z     compiled=True,
2025-05-07T20:33:44.0952371Z )
2025-05-07T20:33:44.0952619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0952806Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.0952811Z 
2025-05-07T20:33:44.0952906Z     @given(
2025-05-07T20:33:44.0953045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0953211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0953390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0953589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0953731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0953823Z     )
2025-05-07T20:33:44.0954109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0954229Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0954318Z         self,
2025-05-07T20:33:44.0954410Z         T: int,
2025-05-07T20:33:44.0954503Z         D: int,
2025-05-07T20:33:44.0954617Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0954722Z         contiguous: bool,
2025-05-07T20:33:44.0954826Z         compiled: bool,
2025-05-07T20:33:44.0954918Z     ) -> None:
2025-05-07T20:33:44.0955029Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0955119Z     
2025-05-07T20:33:44.0955313Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0955402Z     
2025-05-07T20:33:44.0955518Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0955664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0955772Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0955865Z         x0 = x[:, :D]
2025-05-07T20:33:44.0956009Z         x1 = x[:, D:]
2025-05-07T20:33:44.0956100Z     
2025-05-07T20:33:44.0956239Z         if contiguous:
2025-05-07T20:33:44.0956347Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0956454Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0956542Z     
2025-05-07T20:33:44.0956649Z         if scale_ub is not None:
2025-05-07T20:33:44.0956776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0956935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0957024Z             )
2025-05-07T20:33:44.0957118Z         else:
2025-05-07T20:33:44.0957229Z             scale_ub_tensor = None
2025-05-07T20:33:44.0957319Z     
2025-05-07T20:33:44.0957477Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0957585Z             op = silu_mul_quant
2025-05-07T20:33:44.0957687Z             if compiled:
2025-05-07T20:33:44.0957803Z                 op = torch.compile(op)
2025-05-07T20:33:44.0957925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0958017Z     
2025-05-07T20:33:44.0958126Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0958131Z 
2025-05-07T20:33:44.0958245Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0958399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0958519Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0958637Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0959061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.0959170Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.0959741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0959859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0960267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0960538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0960927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0961039Z     kernel = self.compile(
2025-05-07T20:33:44.0961477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0961681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0961832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0961837Z 
2025-05-07T20:33:44.0962150Z self = <triton.compiler.compiler.ASTSource object at 0x7fad078792a0>
2025-05-07T20:33:44.0963278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0963980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289ab1c0>}
2025-05-07T20:33:44.0965013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0965258Z context = <triton._C.libtriton.ir.context object at 0x7fad07891170>
2025-05-07T20:33:44.0965263Z 
2025-05-07T20:33:44.0965467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0965818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0965946Z                            module_map=module_map)
2025-05-07T20:33:44.0966149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0966317Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0966409Z E       ^
2025-05-07T20:33:44.0966925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0966934Z 
2025-05-07T20:33:44.0967492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0967497Z 
2025-05-07T20:33:44.0967625Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0967914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0968007Z     T=1,
2025-05-07T20:33:44.0968100Z     D=7168,
2025-05-07T20:33:44.0968204Z     scale_ub=1200.0,
2025-05-07T20:33:44.0968309Z     contiguous=False,
2025-05-07T20:33:44.0968411Z     compiled=True,
2025-05-07T20:33:44.0968503Z )
2025-05-07T20:33:44.0968782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0968992Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.0969000Z 
2025-05-07T20:33:44.0969093Z     @given(
2025-05-07T20:33:44.0969237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0969357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0969496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0969638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0969781Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0969868Z     )
2025-05-07T20:33:44.0970189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0970308Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0970400Z         self,
2025-05-07T20:33:44.0970497Z         T: int,
2025-05-07T20:33:44.0970590Z         D: int,
2025-05-07T20:33:44.0970708Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0970820Z         contiguous: bool,
2025-05-07T20:33:44.0970921Z         compiled: bool,
2025-05-07T20:33:44.0971016Z     ) -> None:
2025-05-07T20:33:44.0971133Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0971220Z     
2025-05-07T20:33:44.0971430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0971521Z     
2025-05-07T20:33:44.0971630Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0971779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0971888Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0971982Z         x0 = x[:, :D]
2025-05-07T20:33:44.0972081Z         x1 = x[:, D:]
2025-05-07T20:33:44.0972167Z     
2025-05-07T20:33:44.0972267Z         if contiguous:
2025-05-07T20:33:44.0972427Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0972532Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0972659Z     
2025-05-07T20:33:44.0972775Z         if scale_ub is not None:
2025-05-07T20:33:44.0972900Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0973058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0973152Z             )
2025-05-07T20:33:44.0973245Z         else:
2025-05-07T20:33:44.0973360Z             scale_ub_tensor = None
2025-05-07T20:33:44.0973449Z     
2025-05-07T20:33:44.0973600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0973705Z             op = silu_mul_quant
2025-05-07T20:33:44.0973808Z             if compiled:
2025-05-07T20:33:44.0973923Z                 op = torch.compile(op)
2025-05-07T20:33:44.0974051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0974137Z     
2025-05-07T20:33:44.0974243Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.0974248Z 
2025-05-07T20:33:44.0974371Z moe/activation_test.py:117: 
2025-05-07T20:33:44.0974522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0974639Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.0974758Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0975222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.0975379Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.0975942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.0976055Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.0976469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0976724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0977113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0977233Z     kernel = self.compile(
2025-05-07T20:33:44.0977670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0977874Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0978028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0978033Z 
2025-05-07T20:33:44.0978268Z self = <triton.compiler.compiler.ASTSource object at 0x7fad078fda50>
2025-05-07T20:33:44.0979146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0979718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a85e0>}
2025-05-07T20:33:44.0980572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.0980798Z context = <triton._C.libtriton.ir.context object at 0x7fad078d1030>
2025-05-07T20:33:44.0980803Z 
2025-05-07T20:33:44.0980996Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.0981298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.0981448Z                            module_map=module_map)
2025-05-07T20:33:44.0981645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.0981787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.0981890Z E       ^
2025-05-07T20:33:44.0982314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.0982408Z 
2025-05-07T20:33:44.0982881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.0982886Z 
2025-05-07T20:33:44.0983013Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.0983276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.0983368Z     T=1,
2025-05-07T20:33:44.0983462Z     D=7168,
2025-05-07T20:33:44.0983561Z     scale_ub=None,
2025-05-07T20:33:44.0983664Z     contiguous=False,
2025-05-07T20:33:44.0983766Z     compiled=True,
2025-05-07T20:33:44.0983855Z )
2025-05-07T20:33:44.0984106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.0984299Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.0984304Z 
2025-05-07T20:33:44.0984397Z     @given(
2025-05-07T20:33:44.0984542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.0984662Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.0984798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.0984941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.0985076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.0985239Z     )
2025-05-07T20:33:44.0985575Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.0985689Z     def test_silu_mul_quant(
2025-05-07T20:33:44.0985780Z         self,
2025-05-07T20:33:44.0985873Z         T: int,
2025-05-07T20:33:44.0985962Z         D: int,
2025-05-07T20:33:44.0986084Z         scale_ub: Optional[float],
2025-05-07T20:33:44.0986189Z         contiguous: bool,
2025-05-07T20:33:44.0986289Z         compiled: bool,
2025-05-07T20:33:44.0986384Z     ) -> None:
2025-05-07T20:33:44.0986495Z         torch.manual_seed(2025)
2025-05-07T20:33:44.0986581Z     
2025-05-07T20:33:44.0986783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.0986875Z     
2025-05-07T20:33:44.0986984Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.0987135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.0987238Z         x = x_sign * x_clamp
2025-05-07T20:33:44.0987336Z         x0 = x[:, :D]
2025-05-07T20:33:44.0987435Z         x1 = x[:, D:]
2025-05-07T20:33:44.0987522Z     
2025-05-07T20:33:44.0987621Z         if contiguous:
2025-05-07T20:33:44.0987732Z             x0 = x0.contiguous()
2025-05-07T20:33:44.0987836Z             x1 = x1.contiguous()
2025-05-07T20:33:44.0987926Z     
2025-05-07T20:33:44.0988033Z         if scale_ub is not None:
2025-05-07T20:33:44.0988157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.0988318Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.0988409Z             )
2025-05-07T20:33:44.0988498Z         else:
2025-05-07T20:33:44.0988618Z             scale_ub_tensor = None
2025-05-07T20:33:44.0988703Z     
2025-05-07T20:33:44.0988857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0988967Z             op = silu_mul_quant
2025-05-07T20:33:44.0989066Z             if compiled:
2025-05-07T20:33:44.0989185Z                 op = torch.compile(op)
2025-05-07T20:33:44.0989314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.0989403Z     
2025-05-07T20:33:44.0989516Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.0989657Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.0989742Z     
2025-05-07T20:33:44.0989905Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.0990026Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.0990144Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.0990291Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.0990454Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0990594Z     
2025-05-07T20:33:44.0990757Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.0990762Z 
2025-05-07T20:33:44.0990878Z moe/activation_test.py:126: 
2025-05-07T20:33:44.0991031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0991158Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.0991329Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.0991985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.0992103Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.0992513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.0992774Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.0993191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.0993491Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0993996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in <dictcomp>
2025-05-07T20:33:44.0994377Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.0994809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.0995001Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.0995393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.0995484Z     fn()
2025-05-07T20:33:44.0995940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.0996043Z     self.fn.run(
2025-05-07T20:33:44.0996430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.0996538Z     kernel = self.compile(
2025-05-07T20:33:44.0996975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.0997183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.0997335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.0997340Z 
2025-05-07T20:33:44.0997576Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0788ca90>
2025-05-07T20:33:44.0998452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.0999037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2837eef0>}
2025-05-07T20:33:44.0999879Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1000109Z context = <triton._C.libtriton.ir.context object at 0x7fad07f3c370>
2025-05-07T20:33:44.1000114Z 
2025-05-07T20:33:44.1000306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1000611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1000739Z                            module_map=module_map)
2025-05-07T20:33:44.1000925Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1001047Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.1001187Z E       ^
2025-05-07T20:33:44.1001633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1001638Z 
2025-05-07T20:33:44.1002111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1002119Z 
2025-05-07T20:33:44.1002244Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1002503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1002593Z     T=1,
2025-05-07T20:33:44.1002685Z     D=5120,
2025-05-07T20:33:44.1002785Z     scale_ub=1200.0,
2025-05-07T20:33:44.1002886Z     contiguous=False,
2025-05-07T20:33:44.1002987Z     compiled=True,
2025-05-07T20:33:44.1003075Z )
2025-05-07T20:33:44.1003324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1003513Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1003524Z 
2025-05-07T20:33:44.1003616Z     @given(
2025-05-07T20:33:44.1003757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1003875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1004008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1004192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1004369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1004458Z     )
2025-05-07T20:33:44.1004739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1004852Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1004940Z         self,
2025-05-07T20:33:44.1005031Z         T: int,
2025-05-07T20:33:44.1005124Z         D: int,
2025-05-07T20:33:44.1005241Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1005349Z         contiguous: bool,
2025-05-07T20:33:44.1005450Z         compiled: bool,
2025-05-07T20:33:44.1005545Z     ) -> None:
2025-05-07T20:33:44.1005656Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1005744Z     
2025-05-07T20:33:44.1005940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1006031Z     
2025-05-07T20:33:44.1006139Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1006290Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1006402Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1006497Z         x0 = x[:, :D]
2025-05-07T20:33:44.1006589Z         x1 = x[:, D:]
2025-05-07T20:33:44.1006680Z     
2025-05-07T20:33:44.1006778Z         if contiguous:
2025-05-07T20:33:44.1006884Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1006991Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1007078Z     
2025-05-07T20:33:44.1007187Z         if scale_ub is not None:
2025-05-07T20:33:44.1007312Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1007466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1007561Z             )
2025-05-07T20:33:44.1007651Z         else:
2025-05-07T20:33:44.1007765Z             scale_ub_tensor = None
2025-05-07T20:33:44.1007855Z     
2025-05-07T20:33:44.1008005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1008111Z             op = silu_mul_quant
2025-05-07T20:33:44.1008215Z             if compiled:
2025-05-07T20:33:44.1008334Z                 op = torch.compile(op)
2025-05-07T20:33:44.1008457Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1008545Z     
2025-05-07T20:33:44.1008650Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1008655Z 
2025-05-07T20:33:44.1008771Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1008919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1009037Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1009157Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1009572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1009777Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1010341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1010457Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1010897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1011177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1011564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1011677Z     kernel = self.compile(
2025-05-07T20:33:44.1012110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1012314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1012466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1012471Z 
2025-05-07T20:33:44.1012705Z self = <triton.compiler.compiler.ASTSource object at 0x7fad079813f0>
2025-05-07T20:33:44.1013626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1014240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae2837feb0>}
2025-05-07T20:33:44.1015082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1015305Z context = <triton._C.libtriton.ir.context object at 0x7fad07f661f0>
2025-05-07T20:33:44.1015310Z 
2025-05-07T20:33:44.1015504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1015810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1015939Z                            module_map=module_map)
2025-05-07T20:33:44.1016134Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1016250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1016340Z E       ^
2025-05-07T20:33:44.1016747Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1016752Z 
2025-05-07T20:33:44.1017221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1017227Z 
2025-05-07T20:33:44.1017349Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1017609Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1017704Z     T=1,
2025-05-07T20:33:44.1017796Z     D=5120,
2025-05-07T20:33:44.1017893Z     scale_ub=1200.0,
2025-05-07T20:33:44.1017996Z     contiguous=False,
2025-05-07T20:33:44.1018098Z     compiled=False,
2025-05-07T20:33:44.1018187Z )
2025-05-07T20:33:44.1018438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1018636Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1018641Z 
2025-05-07T20:33:44.1018731Z     @given(
2025-05-07T20:33:44.1018867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1018986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1019120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1019258Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1019390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1019525Z     )
2025-05-07T20:33:44.1019879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1019990Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1020080Z         self,
2025-05-07T20:33:44.1020175Z         T: int,
2025-05-07T20:33:44.1020269Z         D: int,
2025-05-07T20:33:44.1020382Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1020495Z         contiguous: bool,
2025-05-07T20:33:44.1020595Z         compiled: bool,
2025-05-07T20:33:44.1020691Z     ) -> None:
2025-05-07T20:33:44.1020802Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1020888Z     
2025-05-07T20:33:44.1021087Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1021195Z     
2025-05-07T20:33:44.1021312Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1021478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1021583Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1021679Z         x0 = x[:, :D]
2025-05-07T20:33:44.1021776Z         x1 = x[:, D:]
2025-05-07T20:33:44.1021862Z     
2025-05-07T20:33:44.1021963Z         if contiguous:
2025-05-07T20:33:44.1022077Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1022183Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1022267Z     
2025-05-07T20:33:44.1022428Z         if scale_ub is not None:
2025-05-07T20:33:44.1022591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1022751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1022839Z             )
2025-05-07T20:33:44.1022930Z         else:
2025-05-07T20:33:44.1023044Z             scale_ub_tensor = None
2025-05-07T20:33:44.1023132Z     
2025-05-07T20:33:44.1027989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1028109Z             op = silu_mul_quant
2025-05-07T20:33:44.1028213Z             if compiled:
2025-05-07T20:33:44.1028336Z                 op = torch.compile(op)
2025-05-07T20:33:44.1028468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1028558Z     
2025-05-07T20:33:44.1028669Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1028675Z 
2025-05-07T20:33:44.1028791Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1028947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1029070Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1029189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1029774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1029891Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1030310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1030566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1030958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1031078Z     kernel = self.compile(
2025-05-07T20:33:44.1031515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1031720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1031875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1031880Z 
2025-05-07T20:33:44.1032121Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07f567a0>
2025-05-07T20:33:44.1033008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1033646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fae289a9120>}
2025-05-07T20:33:44.1034672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1034898Z context = <triton._C.libtriton.ir.context object at 0x7fad07fdb930>
2025-05-07T20:33:44.1034906Z 
2025-05-07T20:33:44.1035097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1035404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1035531Z                            module_map=module_map)
2025-05-07T20:33:44.1035720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1035836Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1035927Z E       ^
2025-05-07T20:33:44.1036337Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1036345Z 
2025-05-07T20:33:44.1036817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1036822Z 
2025-05-07T20:33:44.1036943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1037356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1037451Z     T=16384,
2025-05-07T20:33:44.1037551Z     D=5120,
2025-05-07T20:33:44.1037648Z     scale_ub=1200.0,
2025-05-07T20:33:44.1037748Z     contiguous=False,
2025-05-07T20:33:44.1037850Z     compiled=True,
2025-05-07T20:33:44.1037936Z )
2025-05-07T20:33:44.1038186Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1038397Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1038402Z 
2025-05-07T20:33:44.1038493Z     @given(
2025-05-07T20:33:44.1038632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1038761Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1038895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1039039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1039173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1039263Z     )
2025-05-07T20:33:44.1039555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1039666Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1039756Z         self,
2025-05-07T20:33:44.1039852Z         T: int,
2025-05-07T20:33:44.1039943Z         D: int,
2025-05-07T20:33:44.1040057Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1040167Z         contiguous: bool,
2025-05-07T20:33:44.1040267Z         compiled: bool,
2025-05-07T20:33:44.1040361Z     ) -> None:
2025-05-07T20:33:44.1040479Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1040568Z     
2025-05-07T20:33:44.1040769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1040861Z     
2025-05-07T20:33:44.1040970Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1041119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1041222Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1041319Z         x0 = x[:, :D]
2025-05-07T20:33:44.1041416Z         x1 = x[:, D:]
2025-05-07T20:33:44.1041506Z     
2025-05-07T20:33:44.1041604Z         if contiguous:
2025-05-07T20:33:44.1041714Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1041821Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1041906Z     
2025-05-07T20:33:44.1042016Z         if scale_ub is not None:
2025-05-07T20:33:44.1042139Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1042299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1042387Z             )
2025-05-07T20:33:44.1042479Z         else:
2025-05-07T20:33:44.1042593Z             scale_ub_tensor = None
2025-05-07T20:33:44.1042736Z     
2025-05-07T20:33:44.1042929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1043039Z             op = silu_mul_quant
2025-05-07T20:33:44.1043139Z             if compiled:
2025-05-07T20:33:44.1043257Z                 op = torch.compile(op)
2025-05-07T20:33:44.1043387Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1043476Z     
2025-05-07T20:33:44.1043586Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1043591Z 
2025-05-07T20:33:44.1043707Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1043859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1043977Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1044097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1044516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1044625Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1045198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1045312Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1045718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1046064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1046453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1046568Z     kernel = self.compile(
2025-05-07T20:33:44.1047005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1047210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1047362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1047370Z 
2025-05-07T20:33:44.1047611Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07982d70>
2025-05-07T20:33:44.1048496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1049077Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad075108b0>}
2025-05-07T20:33:44.1049917Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1050139Z context = <triton._C.libtriton.ir.context object at 0x7fad07500370>
2025-05-07T20:33:44.1050144Z 
2025-05-07T20:33:44.1050335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1050644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1050771Z                            module_map=module_map)
2025-05-07T20:33:44.1050983Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1051119Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1051224Z E       ^
2025-05-07T20:33:44.1051632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1051638Z 
2025-05-07T20:33:44.1052106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1052110Z 
2025-05-07T20:33:44.1052232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1052489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1052628Z     T=2048,
2025-05-07T20:33:44.1052718Z     D=7168,
2025-05-07T20:33:44.1052819Z     scale_ub=1200.0,
2025-05-07T20:33:44.1052962Z     contiguous=False,
2025-05-07T20:33:44.1053065Z     compiled=True,
2025-05-07T20:33:44.1053152Z )
2025-05-07T20:33:44.1053401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1053611Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1053616Z 
2025-05-07T20:33:44.1053708Z     @given(
2025-05-07T20:33:44.1053846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1053967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1054099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1054234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1054372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1054462Z     )
2025-05-07T20:33:44.1054749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1054862Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1054954Z         self,
2025-05-07T20:33:44.1055049Z         T: int,
2025-05-07T20:33:44.1055140Z         D: int,
2025-05-07T20:33:44.1055256Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1055364Z         contiguous: bool,
2025-05-07T20:33:44.1055513Z         compiled: bool,
2025-05-07T20:33:44.1055645Z     ) -> None:
2025-05-07T20:33:44.1055761Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1055847Z     
2025-05-07T20:33:44.1056042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1056132Z     
2025-05-07T20:33:44.1056241Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1056390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1056495Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1056591Z         x0 = x[:, :D]
2025-05-07T20:33:44.1056686Z         x1 = x[:, D:]
2025-05-07T20:33:44.1056775Z     
2025-05-07T20:33:44.1056876Z         if contiguous:
2025-05-07T20:33:44.1056988Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1057097Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1057182Z     
2025-05-07T20:33:44.1057293Z         if scale_ub is not None:
2025-05-07T20:33:44.1057414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1057575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1057671Z             )
2025-05-07T20:33:44.1057760Z         else:
2025-05-07T20:33:44.1057870Z             scale_ub_tensor = None
2025-05-07T20:33:44.1057959Z     
2025-05-07T20:33:44.1058110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1058220Z             op = silu_mul_quant
2025-05-07T20:33:44.1058319Z             if compiled:
2025-05-07T20:33:44.1058435Z                 op = torch.compile(op)
2025-05-07T20:33:44.1058562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1058648Z     
2025-05-07T20:33:44.1058753Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1058761Z 
2025-05-07T20:33:44.1058878Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1059028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1059148Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1059267Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1059693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1059803Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1060364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1060477Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1060916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1061200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1061686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1061797Z     kernel = self.compile(
2025-05-07T20:33:44.1062232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1062440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1062587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1062592Z 
2025-05-07T20:33:44.1062825Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0759c040>
2025-05-07T20:33:44.1063709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1064282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07511090>}
2025-05-07T20:33:44.1065128Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1065433Z context = <triton._C.libtriton.ir.context object at 0x7fad075a2db0>
2025-05-07T20:33:44.1065439Z 
2025-05-07T20:33:44.1065632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1065937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1066062Z                            module_map=module_map)
2025-05-07T20:33:44.1066250Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1066365Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1066456Z E       ^
2025-05-07T20:33:44.1066869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1066874Z 
2025-05-07T20:33:44.1067343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1067351Z 
2025-05-07T20:33:44.1067477Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1067734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1067826Z     T=1,
2025-05-07T20:33:44.1067920Z     D=5120,
2025-05-07T20:33:44.1068018Z     scale_ub=None,
2025-05-07T20:33:44.1068118Z     contiguous=False,
2025-05-07T20:33:44.1068221Z     compiled=False,
2025-05-07T20:33:44.1068308Z )
2025-05-07T20:33:44.1068557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1068752Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1068756Z 
2025-05-07T20:33:44.1068849Z     @given(
2025-05-07T20:33:44.1068992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1069109Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1069241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1069379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1069514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1069603Z     )
2025-05-07T20:33:44.1069891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1070000Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1070092Z         self,
2025-05-07T20:33:44.1070182Z         T: int,
2025-05-07T20:33:44.1070272Z         D: int,
2025-05-07T20:33:44.1070391Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1070496Z         contiguous: bool,
2025-05-07T20:33:44.1070598Z         compiled: bool,
2025-05-07T20:33:44.1070694Z     ) -> None:
2025-05-07T20:33:44.1070803Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1070942Z     
2025-05-07T20:33:44.1071209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1071320Z     
2025-05-07T20:33:44.1071439Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1071599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1071706Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1071808Z         x0 = x[:, :D]
2025-05-07T20:33:44.1071903Z         x1 = x[:, D:]
2025-05-07T20:33:44.1071988Z     
2025-05-07T20:33:44.1072089Z         if contiguous:
2025-05-07T20:33:44.1072195Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1072300Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1072390Z     
2025-05-07T20:33:44.1072495Z         if scale_ub is not None:
2025-05-07T20:33:44.1072618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1072776Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1072866Z             )
2025-05-07T20:33:44.1072957Z         else:
2025-05-07T20:33:44.1073070Z             scale_ub_tensor = None
2025-05-07T20:33:44.1073158Z     
2025-05-07T20:33:44.1073311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1073421Z             op = silu_mul_quant
2025-05-07T20:33:44.1073569Z             if compiled:
2025-05-07T20:33:44.1073742Z                 op = torch.compile(op)
2025-05-07T20:33:44.1073905Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1073992Z     
2025-05-07T20:33:44.1074101Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1074105Z 
2025-05-07T20:33:44.1074218Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1074366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1074488Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1074605Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1075172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1075292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1075703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1075961Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1076355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1076467Z     kernel = self.compile(
2025-05-07T20:33:44.1076906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1077109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1077258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1077263Z 
2025-05-07T20:33:44.1077496Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07c82380>
2025-05-07T20:33:44.1078376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1078955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad075117e0>}
2025-05-07T20:33:44.1079800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1080022Z context = <triton._C.libtriton.ir.context object at 0x7fad07c6cab0>
2025-05-07T20:33:44.1080027Z 
2025-05-07T20:33:44.1080218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1080521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1080760Z                            module_map=module_map)
2025-05-07T20:33:44.1080948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1081067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1081158Z E       ^
2025-05-07T20:33:44.1081566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1081571Z 
2025-05-07T20:33:44.1082044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1082048Z 
2025-05-07T20:33:44.1082169Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1082428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1082520Z     T=4096,
2025-05-07T20:33:44.1082611Z     D=7168,
2025-05-07T20:33:44.1082713Z     scale_ub=1200.0,
2025-05-07T20:33:44.1082818Z     contiguous=False,
2025-05-07T20:33:44.1082918Z     compiled=False,
2025-05-07T20:33:44.1083006Z )
2025-05-07T20:33:44.1083258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1083462Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1083513Z 
2025-05-07T20:33:44.1083606Z     @given(
2025-05-07T20:33:44.1083787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1083907Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1084039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1084174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1084309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1084396Z     )
2025-05-07T20:33:44.1084677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1084791Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1084882Z         self,
2025-05-07T20:33:44.1084976Z         T: int,
2025-05-07T20:33:44.1085072Z         D: int,
2025-05-07T20:33:44.1085190Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1085295Z         contiguous: bool,
2025-05-07T20:33:44.1085398Z         compiled: bool,
2025-05-07T20:33:44.1085490Z     ) -> None:
2025-05-07T20:33:44.1085610Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1085695Z     
2025-05-07T20:33:44.1085893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1085986Z     
2025-05-07T20:33:44.1086095Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1086239Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1086344Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1086437Z         x0 = x[:, :D]
2025-05-07T20:33:44.1086530Z         x1 = x[:, D:]
2025-05-07T20:33:44.1086620Z     
2025-05-07T20:33:44.1086717Z         if contiguous:
2025-05-07T20:33:44.1086824Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1086931Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1087019Z     
2025-05-07T20:33:44.1087131Z         if scale_ub is not None:
2025-05-07T20:33:44.1087254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1087411Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1087504Z             )
2025-05-07T20:33:44.1087596Z         else:
2025-05-07T20:33:44.1087708Z             scale_ub_tensor = None
2025-05-07T20:33:44.1087797Z     
2025-05-07T20:33:44.1087949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1088054Z             op = silu_mul_quant
2025-05-07T20:33:44.1088156Z             if compiled:
2025-05-07T20:33:44.1088271Z                 op = torch.compile(op)
2025-05-07T20:33:44.1088393Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1088481Z     
2025-05-07T20:33:44.1088589Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1088594Z 
2025-05-07T20:33:44.1088709Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1088915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1089076Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1089195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1089762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1089882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1090295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1090553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1090976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1091107Z     kernel = self.compile(
2025-05-07T20:33:44.1091542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1091754Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1091899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1091904Z 
2025-05-07T20:33:44.1092143Z self = <triton.compiler.compiler.ASTSource object at 0x7fad075f2da0>
2025-05-07T20:33:44.1093106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1093680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07512200>}
2025-05-07T20:33:44.1094527Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1094753Z context = <triton._C.libtriton.ir.context object at 0x7fad07ccc0b0>
2025-05-07T20:33:44.1094758Z 
2025-05-07T20:33:44.1094952Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1095254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1095387Z                            module_map=module_map)
2025-05-07T20:33:44.1095579Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1095695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1095786Z E       ^
2025-05-07T20:33:44.1096192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1096197Z 
2025-05-07T20:33:44.1096664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1096672Z 
2025-05-07T20:33:44.1096798Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1097054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1097145Z     T=16384,
2025-05-07T20:33:44.1097244Z     D=7168,
2025-05-07T20:33:44.1097340Z     scale_ub=None,
2025-05-07T20:33:44.1097441Z     contiguous=True,
2025-05-07T20:33:44.1097542Z     compiled=True,
2025-05-07T20:33:44.1097631Z )
2025-05-07T20:33:44.1097883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1098083Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.1098088Z 
2025-05-07T20:33:44.1098180Z     @given(
2025-05-07T20:33:44.1098320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1098437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1098574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1098720Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1098903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1099032Z     )
2025-05-07T20:33:44.1099319Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1099428Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1099521Z         self,
2025-05-07T20:33:44.1099616Z         T: int,
2025-05-07T20:33:44.1099709Z         D: int,
2025-05-07T20:33:44.1099828Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1099933Z         contiguous: bool,
2025-05-07T20:33:44.1100033Z         compiled: bool,
2025-05-07T20:33:44.1100128Z     ) -> None:
2025-05-07T20:33:44.1100238Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1100323Z     
2025-05-07T20:33:44.1100521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1100607Z     
2025-05-07T20:33:44.1100714Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1100862Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1100968Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1101064Z         x0 = x[:, :D]
2025-05-07T20:33:44.1101163Z         x1 = x[:, D:]
2025-05-07T20:33:44.1101250Z     
2025-05-07T20:33:44.1101359Z         if contiguous:
2025-05-07T20:33:44.1101481Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1101659Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1101748Z     
2025-05-07T20:33:44.1101897Z         if scale_ub is not None:
2025-05-07T20:33:44.1102023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1102185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1102274Z             )
2025-05-07T20:33:44.1102362Z         else:
2025-05-07T20:33:44.1102475Z             scale_ub_tensor = None
2025-05-07T20:33:44.1102560Z     
2025-05-07T20:33:44.1102709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1102816Z             op = silu_mul_quant
2025-05-07T20:33:44.1102916Z             if compiled:
2025-05-07T20:33:44.1103039Z                 op = torch.compile(op)
2025-05-07T20:33:44.1103164Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1103252Z     
2025-05-07T20:33:44.1103362Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1103367Z 
2025-05-07T20:33:44.1103480Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1103631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1103757Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1103874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1104294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1104404Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1104965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1105081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1105492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1105752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1106144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1106256Z     kernel = self.compile(
2025-05-07T20:33:44.1106699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1106900Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1107048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1107052Z 
2025-05-07T20:33:44.1107291Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07ce2da0>
2025-05-07T20:33:44.1108240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1108861Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07513760>}
2025-05-07T20:33:44.1109705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1109928Z context = <triton._C.libtriton.ir.context object at 0x7fad076e5030>
2025-05-07T20:33:44.1109938Z 
2025-05-07T20:33:44.1110129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1110428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1110555Z                            module_map=module_map)
2025-05-07T20:33:44.1110746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1110866Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1110961Z E       ^
2025-05-07T20:33:44.1111366Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1111416Z 
2025-05-07T20:33:44.1111929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1111935Z 
2025-05-07T20:33:44.1112056Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1112310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1112407Z     T=4096,
2025-05-07T20:33:44.1112497Z     D=5120,
2025-05-07T20:33:44.1112593Z     scale_ub=None,
2025-05-07T20:33:44.1112698Z     contiguous=False,
2025-05-07T20:33:44.1112796Z     compiled=True,
2025-05-07T20:33:44.1112883Z )
2025-05-07T20:33:44.1113138Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1113340Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1113345Z 
2025-05-07T20:33:44.1113437Z     @given(
2025-05-07T20:33:44.1113621Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1113740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1113879Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1114016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1114149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1114242Z     )
2025-05-07T20:33:44.1114527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1114635Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1114730Z         self,
2025-05-07T20:33:44.1114820Z         T: int,
2025-05-07T20:33:44.1114912Z         D: int,
2025-05-07T20:33:44.1115025Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1115133Z         contiguous: bool,
2025-05-07T20:33:44.1115239Z         compiled: bool,
2025-05-07T20:33:44.1115330Z     ) -> None:
2025-05-07T20:33:44.1115440Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1115528Z     
2025-05-07T20:33:44.1115721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1115811Z     
2025-05-07T20:33:44.1115926Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1116070Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1116173Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1116269Z         x0 = x[:, :D]
2025-05-07T20:33:44.1116364Z         x1 = x[:, D:]
2025-05-07T20:33:44.1116454Z     
2025-05-07T20:33:44.1116553Z         if contiguous:
2025-05-07T20:33:44.1116661Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1116769Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1116853Z     
2025-05-07T20:33:44.1116960Z         if scale_ub is not None:
2025-05-07T20:33:44.1117140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1117337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1117428Z             )
2025-05-07T20:33:44.1117521Z         else:
2025-05-07T20:33:44.1117630Z             scale_ub_tensor = None
2025-05-07T20:33:44.1117717Z     
2025-05-07T20:33:44.1117873Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1117981Z             op = silu_mul_quant
2025-05-07T20:33:44.1118081Z             if compiled:
2025-05-07T20:33:44.1118204Z                 op = torch.compile(op)
2025-05-07T20:33:44.1118326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1118413Z     
2025-05-07T20:33:44.1118518Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1118523Z 
2025-05-07T20:33:44.1118635Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1118789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1118906Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1119024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1119449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1119558Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1120169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1120326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1120734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1120994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1121385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1121497Z     kernel = self.compile(
2025-05-07T20:33:44.1121933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1122141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1122291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1122296Z 
2025-05-07T20:33:44.1122532Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07c5dba0>
2025-05-07T20:33:44.1123409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1124280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07610280>}
2025-05-07T20:33:44.1125189Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1125424Z context = <triton._C.libtriton.ir.context object at 0x7fad0769f8b0>
2025-05-07T20:33:44.1125430Z 
2025-05-07T20:33:44.1125623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1125937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1126062Z                            module_map=module_map)
2025-05-07T20:33:44.1126247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1126367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1126458Z E       ^
2025-05-07T20:33:44.1126860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1126865Z 
2025-05-07T20:33:44.1127340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1127442Z 
2025-05-07T20:33:44.1127629Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1127890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1127981Z     T=4096,
2025-05-07T20:33:44.1128073Z     D=5120,
2025-05-07T20:33:44.1128178Z     scale_ub=1200.0,
2025-05-07T20:33:44.1128281Z     contiguous=False,
2025-05-07T20:33:44.1128383Z     compiled=False,
2025-05-07T20:33:44.1128474Z )
2025-05-07T20:33:44.1128722Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1128924Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1128932Z 
2025-05-07T20:33:44.1129023Z     @given(
2025-05-07T20:33:44.1129160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1129280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1129415Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1129556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1129696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1129783Z     )
2025-05-07T20:33:44.1130065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1130482Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1130573Z         self,
2025-05-07T20:33:44.1130729Z         T: int,
2025-05-07T20:33:44.1130826Z         D: int,
2025-05-07T20:33:44.1130942Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1131049Z         contiguous: bool,
2025-05-07T20:33:44.1131151Z         compiled: bool,
2025-05-07T20:33:44.1131245Z     ) -> None:
2025-05-07T20:33:44.1131360Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1131450Z     
2025-05-07T20:33:44.1131675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1131773Z     
2025-05-07T20:33:44.1131899Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1132073Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1132185Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1132280Z         x0 = x[:, :D]
2025-05-07T20:33:44.1132373Z         x1 = x[:, D:]
2025-05-07T20:33:44.1132463Z     
2025-05-07T20:33:44.1132563Z         if contiguous:
2025-05-07T20:33:44.1132675Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1132789Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1132877Z     
2025-05-07T20:33:44.1132986Z         if scale_ub is not None:
2025-05-07T20:33:44.1133112Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1133268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1133362Z             )
2025-05-07T20:33:44.1133451Z         else:
2025-05-07T20:33:44.1133562Z             scale_ub_tensor = None
2025-05-07T20:33:44.1133654Z     
2025-05-07T20:33:44.1133805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1133911Z             op = silu_mul_quant
2025-05-07T20:33:44.1134020Z             if compiled:
2025-05-07T20:33:44.1134140Z                 op = torch.compile(op)
2025-05-07T20:33:44.1134262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1134353Z     
2025-05-07T20:33:44.1134460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1134465Z 
2025-05-07T20:33:44.1134589Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1134742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1134861Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1134982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1135550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1135665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1136079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1136392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1136832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1136945Z     kernel = self.compile(
2025-05-07T20:33:44.1137392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1137604Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1137753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1137758Z 
2025-05-07T20:33:44.1137997Z self = <triton.compiler.compiler.ASTSource object at 0x7fad0768e290>
2025-05-07T20:33:44.1138880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1139462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07611000>}
2025-05-07T20:33:44.1140358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1140649Z context = <triton._C.libtriton.ir.context object at 0x7fad077a8f30>
2025-05-07T20:33:44.1140655Z 
2025-05-07T20:33:44.1140850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1141204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1141330Z                            module_map=module_map)
2025-05-07T20:33:44.1141522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1141638Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1141736Z E       ^
2025-05-07T20:33:44.1142146Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1142151Z 
2025-05-07T20:33:44.1142626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1142637Z 
2025-05-07T20:33:44.1142762Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1143020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1143117Z     T=4096,
2025-05-07T20:33:44.1143209Z     D=5120,
2025-05-07T20:33:44.1143308Z     scale_ub=1200.0,
2025-05-07T20:33:44.1143413Z     contiguous=False,
2025-05-07T20:33:44.1143511Z     compiled=True,
2025-05-07T20:33:44.1143600Z )
2025-05-07T20:33:44.1143853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1144055Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1144063Z 
2025-05-07T20:33:44.1144156Z     @given(
2025-05-07T20:33:44.1144298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1144414Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1144550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1144696Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1144830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1144921Z     )
2025-05-07T20:33:44.1145206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1145317Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1145410Z         self,
2025-05-07T20:33:44.1145501Z         T: int,
2025-05-07T20:33:44.1145593Z         D: int,
2025-05-07T20:33:44.1145711Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1145817Z         contiguous: bool,
2025-05-07T20:33:44.1145918Z         compiled: bool,
2025-05-07T20:33:44.1146066Z     ) -> None:
2025-05-07T20:33:44.1146178Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1146306Z     
2025-05-07T20:33:44.1146510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1146598Z     
2025-05-07T20:33:44.1146711Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1146860Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1146967Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1147064Z         x0 = x[:, :D]
2025-05-07T20:33:44.1147158Z         x1 = x[:, D:]
2025-05-07T20:33:44.1147244Z     
2025-05-07T20:33:44.1147349Z         if contiguous:
2025-05-07T20:33:44.1147457Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1147561Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1147650Z     
2025-05-07T20:33:44.1147761Z         if scale_ub is not None:
2025-05-07T20:33:44.1147884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1148044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1148137Z             )
2025-05-07T20:33:44.1148238Z         else:
2025-05-07T20:33:44.1148351Z             scale_ub_tensor = None
2025-05-07T20:33:44.1148438Z     
2025-05-07T20:33:44.1148596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1148703Z             op = silu_mul_quant
2025-05-07T20:33:44.1148854Z             if compiled:
2025-05-07T20:33:44.1149016Z                 op = torch.compile(op)
2025-05-07T20:33:44.1149141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1149227Z     
2025-05-07T20:33:44.1149338Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1149343Z 
2025-05-07T20:33:44.1149456Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1149606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1149730Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1149847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1150276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1150393Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1150960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1151085Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1151552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1151814Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1152207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1152317Z     kernel = self.compile(
2025-05-07T20:33:44.1152760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1152963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1153118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1153123Z 
2025-05-07T20:33:44.1153361Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07ce18d0>
2025-05-07T20:33:44.1154305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1154891Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07610700>}
2025-05-07T20:33:44.1155736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1156013Z context = <triton._C.libtriton.ir.context object at 0x7fad07741530>
2025-05-07T20:33:44.1156058Z 
2025-05-07T20:33:44.1156250Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1156555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1156686Z                            module_map=module_map)
2025-05-07T20:33:44.1156877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1156993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1157087Z E       ^
2025-05-07T20:33:44.1157493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1157497Z 
2025-05-07T20:33:44.1157974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1157978Z 
2025-05-07T20:33:44.1158101Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1158363Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1158459Z     T=2048,
2025-05-07T20:33:44.1158551Z     D=7168,
2025-05-07T20:33:44.1158649Z     scale_ub=1200.0,
2025-05-07T20:33:44.1158753Z     contiguous=False,
2025-05-07T20:33:44.1158900Z     compiled=False,
2025-05-07T20:33:44.1158989Z )
2025-05-07T20:33:44.1159281Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1159485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1159490Z 
2025-05-07T20:33:44.1159584Z     @given(
2025-05-07T20:33:44.1159721Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1159838Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1159975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1160112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1160248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1160339Z     )
2025-05-07T20:33:44.1160628Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1160748Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1160856Z         self,
2025-05-07T20:33:44.1160961Z         T: int,
2025-05-07T20:33:44.1161072Z         D: int,
2025-05-07T20:33:44.1161191Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1161297Z         contiguous: bool,
2025-05-07T20:33:44.1161400Z         compiled: bool,
2025-05-07T20:33:44.1161493Z     ) -> None:
2025-05-07T20:33:44.1161606Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1161696Z     
2025-05-07T20:33:44.1161890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1161978Z     
2025-05-07T20:33:44.1162089Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1162237Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1162345Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1162442Z         x0 = x[:, :D]
2025-05-07T20:33:44.1162536Z         x1 = x[:, D:]
2025-05-07T20:33:44.1162630Z     
2025-05-07T20:33:44.1162729Z         if contiguous:
2025-05-07T20:33:44.1162837Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1162947Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1163036Z     
2025-05-07T20:33:44.1163143Z         if scale_ub is not None:
2025-05-07T20:33:44.1163274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1163430Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1163520Z             )
2025-05-07T20:33:44.1163614Z         else:
2025-05-07T20:33:44.1163725Z             scale_ub_tensor = None
2025-05-07T20:33:44.1163816Z     
2025-05-07T20:33:44.1163968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1164073Z             op = silu_mul_quant
2025-05-07T20:33:44.1164176Z             if compiled:
2025-05-07T20:33:44.1164294Z                 op = torch.compile(op)
2025-05-07T20:33:44.1164470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1164604Z     
2025-05-07T20:33:44.1164712Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1164717Z 
2025-05-07T20:33:44.1164832Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1164985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1165109Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1165231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1165800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1165915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1166329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1166590Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1170117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1170234Z     kernel = self.compile(
2025-05-07T20:33:44.1170689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1170965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1171157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1171163Z 
2025-05-07T20:33:44.1171431Z self = <triton.compiler.compiler.ASTSource object at 0x7fad076b9b40>
2025-05-07T20:33:44.1172303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1172879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07611240>}
2025-05-07T20:33:44.1173724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1173950Z context = <triton._C.libtriton.ir.context object at 0x7fad077dabb0>
2025-05-07T20:33:44.1173959Z 
2025-05-07T20:33:44.1174151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1174452Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1174581Z                            module_map=module_map)
2025-05-07T20:33:44.1174767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1174881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1174974Z E       ^
2025-05-07T20:33:44.1175380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1175389Z 
2025-05-07T20:33:44.1175863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1175868Z 
2025-05-07T20:33:44.1175992Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1176249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1176345Z     T=1,
2025-05-07T20:33:44.1176435Z     D=7168,
2025-05-07T20:33:44.1176533Z     scale_ub=None,
2025-05-07T20:33:44.1176636Z     contiguous=True,
2025-05-07T20:33:44.1176735Z     compiled=False,
2025-05-07T20:33:44.1176820Z )
2025-05-07T20:33:44.1177073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1177264Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1177269Z 
2025-05-07T20:33:44.1177362Z     @given(
2025-05-07T20:33:44.1177549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1177710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1177850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1177985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1178118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1178210Z     )
2025-05-07T20:33:44.1178498Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1178615Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1178704Z         self,
2025-05-07T20:33:44.1178793Z         T: int,
2025-05-07T20:33:44.1178887Z         D: int,
2025-05-07T20:33:44.1179001Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1179105Z         contiguous: bool,
2025-05-07T20:33:44.1179211Z         compiled: bool,
2025-05-07T20:33:44.1179303Z     ) -> None:
2025-05-07T20:33:44.1179415Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1179508Z     
2025-05-07T20:33:44.1179704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1179795Z     
2025-05-07T20:33:44.1179907Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1180051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1180155Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1180328Z         x0 = x[:, :D]
2025-05-07T20:33:44.1180462Z         x1 = x[:, D:]
2025-05-07T20:33:44.1180553Z     
2025-05-07T20:33:44.1180653Z         if contiguous:
2025-05-07T20:33:44.1180760Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1180867Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1180952Z     
2025-05-07T20:33:44.1181059Z         if scale_ub is not None:
2025-05-07T20:33:44.1181185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1181344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1181432Z             )
2025-05-07T20:33:44.1181524Z         else:
2025-05-07T20:33:44.1181644Z             scale_ub_tensor = None
2025-05-07T20:33:44.1181730Z     
2025-05-07T20:33:44.1181887Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1181992Z             op = silu_mul_quant
2025-05-07T20:33:44.1182095Z             if compiled:
2025-05-07T20:33:44.1182212Z                 op = torch.compile(op)
2025-05-07T20:33:44.1182338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1182429Z     
2025-05-07T20:33:44.1182538Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1182543Z 
2025-05-07T20:33:44.1182655Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1182807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1182925Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1183042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1183613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1183730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1184146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1184403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1184796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1184909Z     kernel = self.compile(
2025-05-07T20:33:44.1185345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1185548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1185699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1185704Z 
2025-05-07T20:33:44.1185938Z self = <triton.compiler.compiler.ASTSource object at 0x7fad077d0670>
2025-05-07T20:33:44.1186871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1187486Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07612050>}
2025-05-07T20:33:44.1188333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1188557Z context = <triton._C.libtriton.ir.context object at 0x7fad07bd6a30>
2025-05-07T20:33:44.1188562Z 
2025-05-07T20:33:44.1188752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1189053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1189186Z                            module_map=module_map)
2025-05-07T20:33:44.1189374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1189491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1189581Z E       ^
2025-05-07T20:33:44.1190025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1190070Z 
2025-05-07T20:33:44.1190625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1190632Z 
2025-05-07T20:33:44.1190784Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1191108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1191221Z     T=16384,
2025-05-07T20:33:44.1191335Z     D=7168,
2025-05-07T20:33:44.1191462Z     scale_ub=1200.0,
2025-05-07T20:33:44.1191589Z     contiguous=False,
2025-05-07T20:33:44.1191714Z     compiled=True,
2025-05-07T20:33:44.1191825Z )
2025-05-07T20:33:44.1192144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1192404Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1192410Z 
2025-05-07T20:33:44.1192528Z     @given(
2025-05-07T20:33:44.1192706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1192854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1193007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1193144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1193282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1193369Z     )
2025-05-07T20:33:44.1193729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1193843Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1193933Z         self,
2025-05-07T20:33:44.1194024Z         T: int,
2025-05-07T20:33:44.1194121Z         D: int,
2025-05-07T20:33:44.1194240Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1194348Z         contiguous: bool,
2025-05-07T20:33:44.1194451Z         compiled: bool,
2025-05-07T20:33:44.1194545Z     ) -> None:
2025-05-07T20:33:44.1194658Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1194752Z     
2025-05-07T20:33:44.1194951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1195044Z     
2025-05-07T20:33:44.1195152Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1195297Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1195407Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1195503Z         x0 = x[:, :D]
2025-05-07T20:33:44.1195599Z         x1 = x[:, D:]
2025-05-07T20:33:44.1195687Z     
2025-05-07T20:33:44.1195786Z         if contiguous:
2025-05-07T20:33:44.1195896Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1196007Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1196151Z     
2025-05-07T20:33:44.1196260Z         if scale_ub is not None:
2025-05-07T20:33:44.1196429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1196587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1196679Z             )
2025-05-07T20:33:44.1196769Z         else:
2025-05-07T20:33:44.1196884Z             scale_ub_tensor = None
2025-05-07T20:33:44.1196973Z     
2025-05-07T20:33:44.1197130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1197236Z             op = silu_mul_quant
2025-05-07T20:33:44.1197341Z             if compiled:
2025-05-07T20:33:44.1197460Z                 op = torch.compile(op)
2025-05-07T20:33:44.1197583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1197672Z     
2025-05-07T20:33:44.1197780Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1197784Z 
2025-05-07T20:33:44.1197899Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1198053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1198176Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1198300Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1198722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1198881Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1199494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1199610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1200021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1200308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1200798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1200943Z     kernel = self.compile(
2025-05-07T20:33:44.1201496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1201751Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1201941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1201951Z 
2025-05-07T20:33:44.1202199Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07b82920>
2025-05-07T20:33:44.1203085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1203663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07613490>}
2025-05-07T20:33:44.1204516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1204743Z context = <triton._C.libtriton.ir.context object at 0x7fad07b5b070>
2025-05-07T20:33:44.1204751Z 
2025-05-07T20:33:44.1204945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1205252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1205378Z                            module_map=module_map)
2025-05-07T20:33:44.1205564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1205683Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1205774Z E       ^
2025-05-07T20:33:44.1206187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1206192Z 
2025-05-07T20:33:44.1206753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1206759Z 
2025-05-07T20:33:44.1206882Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1207142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1207236Z     T=1,
2025-05-07T20:33:44.1207326Z     D=7168,
2025-05-07T20:33:44.1207430Z     scale_ub=None,
2025-05-07T20:33:44.1207533Z     contiguous=False,
2025-05-07T20:33:44.1207635Z     compiled=False,
2025-05-07T20:33:44.1207723Z )
2025-05-07T20:33:44.1207973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1208170Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1208175Z 
2025-05-07T20:33:44.1208267Z     @given(
2025-05-07T20:33:44.1208405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1208526Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1208663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1208803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1208941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1209030Z     )
2025-05-07T20:33:44.1209320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1209520Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1209613Z         self,
2025-05-07T20:33:44.1209707Z         T: int,
2025-05-07T20:33:44.1209797Z         D: int,
2025-05-07T20:33:44.1209912Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1210022Z         contiguous: bool,
2025-05-07T20:33:44.1210124Z         compiled: bool,
2025-05-07T20:33:44.1210216Z     ) -> None:
2025-05-07T20:33:44.1210330Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1210416Z     
2025-05-07T20:33:44.1210614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1210707Z     
2025-05-07T20:33:44.1210815Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1210972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1211078Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1211173Z         x0 = x[:, :D]
2025-05-07T20:33:44.1211272Z         x1 = x[:, D:]
2025-05-07T20:33:44.1211361Z     
2025-05-07T20:33:44.1211459Z         if contiguous:
2025-05-07T20:33:44.1211573Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1211678Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1211766Z     
2025-05-07T20:33:44.1211875Z         if scale_ub is not None:
2025-05-07T20:33:44.1212000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1212157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1212249Z             )
2025-05-07T20:33:44.1212340Z         else:
2025-05-07T20:33:44.1212455Z             scale_ub_tensor = None
2025-05-07T20:33:44.1212542Z     
2025-05-07T20:33:44.1212694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1212806Z             op = silu_mul_quant
2025-05-07T20:33:44.1212909Z             if compiled:
2025-05-07T20:33:44.1213025Z                 op = torch.compile(op)
2025-05-07T20:33:44.1213156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1213243Z     
2025-05-07T20:33:44.1213352Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1213356Z 
2025-05-07T20:33:44.1213476Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1213627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1213749Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1213865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1214438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1214554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1214969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1215345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1215741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1215853Z     kernel = self.compile(
2025-05-07T20:33:44.1216300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1216504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1216651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1216656Z 
2025-05-07T20:33:44.1216898Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07baeaa0>
2025-05-07T20:33:44.1217779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1218364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad076137f0>}
2025-05-07T20:33:44.1219294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1219520Z context = <triton._C.libtriton.ir.context object at 0x7fad073bce30>
2025-05-07T20:33:44.1219528Z 
2025-05-07T20:33:44.1219721Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1220025Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1220156Z                            module_map=module_map)
2025-05-07T20:33:44.1220350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1220488Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1220611Z E       ^
2025-05-07T20:33:44.1221120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1221127Z 
2025-05-07T20:33:44.1221731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1221737Z 
2025-05-07T20:33:44.1221890Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1222211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1222329Z     T=2048,
2025-05-07T20:33:44.1222442Z     D=7168,
2025-05-07T20:33:44.1222563Z     scale_ub=None,
2025-05-07T20:33:44.1222686Z     contiguous=False,
2025-05-07T20:33:44.1222783Z     compiled=True,
2025-05-07T20:33:44.1222870Z )
2025-05-07T20:33:44.1223124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1223332Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1223337Z 
2025-05-07T20:33:44.1223429Z     @given(
2025-05-07T20:33:44.1223568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1223685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1224068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1224271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1224420Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1224511Z     )
2025-05-07T20:33:44.1224836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1224948Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1225045Z         self,
2025-05-07T20:33:44.1225137Z         T: int,
2025-05-07T20:33:44.1225232Z         D: int,
2025-05-07T20:33:44.1225351Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1225553Z         contiguous: bool,
2025-05-07T20:33:44.1225656Z         compiled: bool,
2025-05-07T20:33:44.1225812Z     ) -> None:
2025-05-07T20:33:44.1225925Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1226013Z     
2025-05-07T20:33:44.1226209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1226299Z     
2025-05-07T20:33:44.1226410Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1226558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1226662Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1226760Z         x0 = x[:, :D]
2025-05-07T20:33:44.1226854Z         x1 = x[:, D:]
2025-05-07T20:33:44.1226940Z     
2025-05-07T20:33:44.1227045Z         if contiguous:
2025-05-07T20:33:44.1227153Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1227263Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1227348Z     
2025-05-07T20:33:44.1227455Z         if scale_ub is not None:
2025-05-07T20:33:44.1227581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1227741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1227833Z             )
2025-05-07T20:33:44.1227926Z         else:
2025-05-07T20:33:44.1228037Z             scale_ub_tensor = None
2025-05-07T20:33:44.1228124Z     
2025-05-07T20:33:44.1228281Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1228518Z             op = silu_mul_quant
2025-05-07T20:33:44.1228621Z             if compiled:
2025-05-07T20:33:44.1228740Z                 op = torch.compile(op)
2025-05-07T20:33:44.1228862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1228953Z     
2025-05-07T20:33:44.1229059Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1229064Z 
2025-05-07T20:33:44.1229178Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1229329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1229447Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1229563Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1229999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1230110Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1230681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1230802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1231259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1231528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1231921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1232030Z     kernel = self.compile(
2025-05-07T20:33:44.1232472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1232683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1232835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1232840Z 
2025-05-07T20:33:44.1233077Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07b716f0>
2025-05-07T20:33:44.1234028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1234605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07300280>}
2025-05-07T20:33:44.1235454Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1235790Z context = <triton._C.libtriton.ir.context object at 0x7fad0739a030>
2025-05-07T20:33:44.1235796Z 
2025-05-07T20:33:44.1235990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1236300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1236435Z                            module_map=module_map)
2025-05-07T20:33:44.1236624Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1236745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1236838Z E       ^
2025-05-07T20:33:44.1237245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1237250Z 
2025-05-07T20:33:44.1237727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1237734Z 
2025-05-07T20:33:44.1237856Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1238120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1238211Z     T=4096,
2025-05-07T20:33:44.1238300Z     D=7168,
2025-05-07T20:33:44.1238399Z     scale_ub=None,
2025-05-07T20:33:44.1238549Z     contiguous=False,
2025-05-07T20:33:44.1238688Z     compiled=True,
2025-05-07T20:33:44.1238779Z )
2025-05-07T20:33:44.1239031Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1239232Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1239241Z 
2025-05-07T20:33:44.1239333Z     @given(
2025-05-07T20:33:44.1239471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1239593Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1239728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1239864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1240004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1240093Z     )
2025-05-07T20:33:44.1240378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1240493Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1240586Z         self,
2025-05-07T20:33:44.1240676Z         T: int,
2025-05-07T20:33:44.1240775Z         D: int,
2025-05-07T20:33:44.1240891Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1240999Z         contiguous: bool,
2025-05-07T20:33:44.1241100Z         compiled: bool,
2025-05-07T20:33:44.1241194Z     ) -> None:
2025-05-07T20:33:44.1241308Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1241395Z     
2025-05-07T20:33:44.1241591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1241688Z     
2025-05-07T20:33:44.1241795Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1241941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1242051Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1242148Z         x0 = x[:, :D]
2025-05-07T20:33:44.1242243Z         x1 = x[:, D:]
2025-05-07T20:33:44.1242332Z     
2025-05-07T20:33:44.1242435Z         if contiguous:
2025-05-07T20:33:44.1242542Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1242653Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1242740Z     
2025-05-07T20:33:44.1242849Z         if scale_ub is not None:
2025-05-07T20:33:44.1242977Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1243133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1243226Z             )
2025-05-07T20:33:44.1243315Z         else:
2025-05-07T20:33:44.1243425Z             scale_ub_tensor = None
2025-05-07T20:33:44.1243515Z     
2025-05-07T20:33:44.1243666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1243772Z             op = silu_mul_quant
2025-05-07T20:33:44.1243876Z             if compiled:
2025-05-07T20:33:44.1244047Z                 op = torch.compile(op)
2025-05-07T20:33:44.1244227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1244319Z     
2025-05-07T20:33:44.1244426Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1244431Z 
2025-05-07T20:33:44.1244548Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1244707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1244825Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1244947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1245368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1245477Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1246046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1246160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1246582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1246843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1247236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1247443Z     kernel = self.compile(
2025-05-07T20:33:44.1247883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1248089Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1248241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1248245Z 
2025-05-07T20:33:44.1248482Z self = <triton.compiler.compiler.ASTSource object at 0x7fad077ac700>
2025-05-07T20:33:44.1249377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1249957Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07303520>}
2025-05-07T20:33:44.1250869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1251150Z context = <triton._C.libtriton.ir.context object at 0x7fad07244cb0>
2025-05-07T20:33:44.1251156Z 
2025-05-07T20:33:44.1251400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1251785Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1251942Z                            module_map=module_map)
2025-05-07T20:33:44.1252190Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1252343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1252459Z E       ^
2025-05-07T20:33:44.1252919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1252927Z 
2025-05-07T20:33:44.1253404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1253409Z 
2025-05-07T20:33:44.1253532Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1253793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1253886Z     T=16384,
2025-05-07T20:33:44.1253979Z     D=5120,
2025-05-07T20:33:44.1254077Z     scale_ub=1200.0,
2025-05-07T20:33:44.1254179Z     contiguous=False,
2025-05-07T20:33:44.1254285Z     compiled=False,
2025-05-07T20:33:44.1254373Z )
2025-05-07T20:33:44.1254702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1254957Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1254962Z 
2025-05-07T20:33:44.1255053Z     @given(
2025-05-07T20:33:44.1255193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1255317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1255457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1255599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1255732Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1255821Z     )
2025-05-07T20:33:44.1256111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1256221Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1256311Z         self,
2025-05-07T20:33:44.1256406Z         T: int,
2025-05-07T20:33:44.1256497Z         D: int,
2025-05-07T20:33:44.1256615Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1256723Z         contiguous: bool,
2025-05-07T20:33:44.1256829Z         compiled: bool,
2025-05-07T20:33:44.1256921Z     ) -> None:
2025-05-07T20:33:44.1257037Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1257122Z     
2025-05-07T20:33:44.1257322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1257458Z     
2025-05-07T20:33:44.1257608Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1257760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1257866Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1257960Z         x0 = x[:, :D]
2025-05-07T20:33:44.1258059Z         x1 = x[:, D:]
2025-05-07T20:33:44.1258146Z     
2025-05-07T20:33:44.1258244Z         if contiguous:
2025-05-07T20:33:44.1258354Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1258460Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1258547Z     
2025-05-07T20:33:44.1258658Z         if scale_ub is not None:
2025-05-07T20:33:44.1258786Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1258950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1259039Z             )
2025-05-07T20:33:44.1259130Z         else:
2025-05-07T20:33:44.1259247Z             scale_ub_tensor = None
2025-05-07T20:33:44.1259337Z     
2025-05-07T20:33:44.1259492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1259601Z             op = silu_mul_quant
2025-05-07T20:33:44.1259704Z             if compiled:
2025-05-07T20:33:44.1259821Z                 op = torch.compile(op)
2025-05-07T20:33:44.1259950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1260036Z     
2025-05-07T20:33:44.1260143Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1260148Z 
2025-05-07T20:33:44.1260268Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1260443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1260599Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1260745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1261462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1261608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1262128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1262436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1262834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1262945Z     kernel = self.compile(
2025-05-07T20:33:44.1263390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1263593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1263833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1263838Z 
2025-05-07T20:33:44.1264078Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07370160>
2025-05-07T20:33:44.1264961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1265548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07302a70>}
2025-05-07T20:33:44.1266396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1266618Z context = <triton._C.libtriton.ir.context object at 0x7fad0722e070>
2025-05-07T20:33:44.1266629Z 
2025-05-07T20:33:44.1266825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1267132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1267261Z                            module_map=module_map)
2025-05-07T20:33:44.1267615Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1267734Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1267829Z E       ^
2025-05-07T20:33:44.1268236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1268241Z 
2025-05-07T20:33:44.1268718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1268723Z 
2025-05-07T20:33:44.1268845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1269106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1269208Z     T=16384,
2025-05-07T20:33:44.1269298Z     D=5120,
2025-05-07T20:33:44.1269396Z     scale_ub=1200.0,
2025-05-07T20:33:44.1269498Z     contiguous=True,
2025-05-07T20:33:44.1269599Z     compiled=True,
2025-05-07T20:33:44.1269690Z )
2025-05-07T20:33:44.1269953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1270159Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1270164Z 
2025-05-07T20:33:44.1270258Z     @given(
2025-05-07T20:33:44.1270396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1270514Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1270651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1270788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1270920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1271016Z     )
2025-05-07T20:33:44.1271304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1271419Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1271509Z         self,
2025-05-07T20:33:44.1271601Z         T: int,
2025-05-07T20:33:44.1271695Z         D: int,
2025-05-07T20:33:44.1271812Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1271921Z         contiguous: bool,
2025-05-07T20:33:44.1272027Z         compiled: bool,
2025-05-07T20:33:44.1272119Z     ) -> None:
2025-05-07T20:33:44.1272230Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1272321Z     
2025-05-07T20:33:44.1272519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1272608Z     
2025-05-07T20:33:44.1272721Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1272871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1272975Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1273072Z         x0 = x[:, :D]
2025-05-07T20:33:44.1273217Z         x1 = x[:, D:]
2025-05-07T20:33:44.1273307Z     
2025-05-07T20:33:44.1273447Z         if contiguous:
2025-05-07T20:33:44.1273605Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1273714Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1273799Z     
2025-05-07T20:33:44.1273905Z         if scale_ub is not None:
2025-05-07T20:33:44.1274037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1274195Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1274286Z             )
2025-05-07T20:33:44.1274379Z         else:
2025-05-07T20:33:44.1274490Z             scale_ub_tensor = None
2025-05-07T20:33:44.1274575Z     
2025-05-07T20:33:44.1274729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1274835Z             op = silu_mul_quant
2025-05-07T20:33:44.1274939Z             if compiled:
2025-05-07T20:33:44.1275056Z                 op = torch.compile(op)
2025-05-07T20:33:44.1275178Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1275269Z     
2025-05-07T20:33:44.1275383Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1275388Z 
2025-05-07T20:33:44.1275501Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1275654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1275828Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1275988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1276417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1276525Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1277094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1277208Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1277618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1277889Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1278281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1278395Z     kernel = self.compile(
2025-05-07T20:33:44.1278841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1279046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1279196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1279201Z 
2025-05-07T20:33:44.1279437Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07311180>
2025-05-07T20:33:44.1280322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1280909Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07302560>}
2025-05-07T20:33:44.1281810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1282038Z context = <triton._C.libtriton.ir.context object at 0x7fad07202d30>
2025-05-07T20:33:44.1282043Z 
2025-05-07T20:33:44.1282237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1282543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1282668Z                            module_map=module_map)
2025-05-07T20:33:44.1282855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1283025Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1283116Z E       ^
2025-05-07T20:33:44.1283563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1283568Z 
2025-05-07T20:33:44.1284045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1284055Z 
2025-05-07T20:33:44.1284177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1284436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1284529Z     T=16384,
2025-05-07T20:33:44.1284620Z     D=5120,
2025-05-07T20:33:44.1284721Z     scale_ub=None,
2025-05-07T20:33:44.1284823Z     contiguous=False,
2025-05-07T20:33:44.1284921Z     compiled=True,
2025-05-07T20:33:44.1285010Z )
2025-05-07T20:33:44.1285260Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1285470Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1285475Z 
2025-05-07T20:33:44.1285568Z     @given(
2025-05-07T20:33:44.1285705Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1285824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1285957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1286213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1286351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1286441Z     )
2025-05-07T20:33:44.1286725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1286840Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1286931Z         self,
2025-05-07T20:33:44.1287024Z         T: int,
2025-05-07T20:33:44.1287115Z         D: int,
2025-05-07T20:33:44.1287231Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1287339Z         contiguous: bool,
2025-05-07T20:33:44.1287442Z         compiled: bool,
2025-05-07T20:33:44.1287534Z     ) -> None:
2025-05-07T20:33:44.1287650Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1287736Z     
2025-05-07T20:33:44.1287932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1288022Z     
2025-05-07T20:33:44.1288130Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1288283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1288389Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1288485Z         x0 = x[:, :D]
2025-05-07T20:33:44.1288578Z         x1 = x[:, D:]
2025-05-07T20:33:44.1288666Z     
2025-05-07T20:33:44.1288765Z         if contiguous:
2025-05-07T20:33:44.1288874Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1288979Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1289064Z     
2025-05-07T20:33:44.1289172Z         if scale_ub is not None:
2025-05-07T20:33:44.1289296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1289452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1289548Z             )
2025-05-07T20:33:44.1289639Z         else:
2025-05-07T20:33:44.1289754Z             scale_ub_tensor = None
2025-05-07T20:33:44.1289843Z     
2025-05-07T20:33:44.1289994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1290103Z             op = silu_mul_quant
2025-05-07T20:33:44.1290208Z             if compiled:
2025-05-07T20:33:44.1290326Z                 op = torch.compile(op)
2025-05-07T20:33:44.1290451Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1290538Z     
2025-05-07T20:33:44.1290644Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1290649Z 
2025-05-07T20:33:44.1290765Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1290915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1291032Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1291172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1291719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1291830Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1292397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1292516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1292935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1293193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1293582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1293697Z     kernel = self.compile(
2025-05-07T20:33:44.1294133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1294342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1294493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1294498Z 
2025-05-07T20:33:44.1294735Z self = <triton.compiler.compiler.ASTSource object at 0x7fad072c43d0>
2025-05-07T20:33:44.1295707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1296284Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07303760>}
2025-05-07T20:33:44.1297133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1297362Z context = <triton._C.libtriton.ir.context object at 0x7fad074294b0>
2025-05-07T20:33:44.1297367Z 
2025-05-07T20:33:44.1297558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1297865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1297997Z                            module_map=module_map)
2025-05-07T20:33:44.1298188Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1298303Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1298394Z E       ^
2025-05-07T20:33:44.1298803Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1298808Z 
2025-05-07T20:33:44.1299279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1299283Z 
2025-05-07T20:33:44.1299412Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1299670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1299760Z     T=2048,
2025-05-07T20:33:44.1299854Z     D=5120,
2025-05-07T20:33:44.1299952Z     scale_ub=None,
2025-05-07T20:33:44.1300053Z     contiguous=False,
2025-05-07T20:33:44.1300156Z     compiled=True,
2025-05-07T20:33:44.1300242Z )
2025-05-07T20:33:44.1300495Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1300702Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1300706Z 
2025-05-07T20:33:44.1300799Z     @given(
2025-05-07T20:33:44.1300938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1301053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1301187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1301326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1301508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1301595Z     )
2025-05-07T20:33:44.1301927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1302038Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1302129Z         self,
2025-05-07T20:33:44.1302225Z         T: int,
2025-05-07T20:33:44.1302315Z         D: int,
2025-05-07T20:33:44.1302436Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1302542Z         contiguous: bool,
2025-05-07T20:33:44.1302643Z         compiled: bool,
2025-05-07T20:33:44.1302737Z     ) -> None:
2025-05-07T20:33:44.1302847Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1302933Z     
2025-05-07T20:33:44.1303130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1303217Z     
2025-05-07T20:33:44.1303323Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1303471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1303575Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1303674Z         x0 = x[:, :D]
2025-05-07T20:33:44.1303778Z         x1 = x[:, D:]
2025-05-07T20:33:44.1303862Z     
2025-05-07T20:33:44.1303961Z         if contiguous:
2025-05-07T20:33:44.1304071Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1304176Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1304311Z     
2025-05-07T20:33:44.1304468Z         if scale_ub is not None:
2025-05-07T20:33:44.1304592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1304754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1304842Z             )
2025-05-07T20:33:44.1304932Z         else:
2025-05-07T20:33:44.1305044Z             scale_ub_tensor = None
2025-05-07T20:33:44.1305134Z     
2025-05-07T20:33:44.1305285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1305394Z             op = silu_mul_quant
2025-05-07T20:33:44.1305494Z             if compiled:
2025-05-07T20:33:44.1305610Z                 op = torch.compile(op)
2025-05-07T20:33:44.1305740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1305829Z     
2025-05-07T20:33:44.1305940Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1305945Z 
2025-05-07T20:33:44.1306059Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1306207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1306334Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1306451Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1306871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1306984Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1307547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1307664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1308075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1308337Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1308732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1308845Z     kernel = self.compile(
2025-05-07T20:33:44.1309286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1309497Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1309644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1309649Z 
2025-05-07T20:33:44.1309887Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07494370>
2025-05-07T20:33:44.1310822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1311493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad074543a0>}
2025-05-07T20:33:44.1312348Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1312570Z context = <triton._C.libtriton.ir.context object at 0x7fad074d42b0>
2025-05-07T20:33:44.1312576Z 
2025-05-07T20:33:44.1312771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1313074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1316241Z                            module_map=module_map)
2025-05-07T20:33:44.1316467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1316597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1316689Z E       ^
2025-05-07T20:33:44.1317174Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1317254Z 
2025-05-07T20:33:44.1317788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1317793Z 
2025-05-07T20:33:44.1317918Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1318180Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1318271Z     T=2048,
2025-05-07T20:33:44.1318363Z     D=5120,
2025-05-07T20:33:44.1318464Z     scale_ub=1200.0,
2025-05-07T20:33:44.1318566Z     contiguous=False,
2025-05-07T20:33:44.1318664Z     compiled=True,
2025-05-07T20:33:44.1318755Z )
2025-05-07T20:33:44.1319006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1319214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1319223Z 
2025-05-07T20:33:44.1319315Z     @given(
2025-05-07T20:33:44.1319454Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1319579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1319722Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1319859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1319997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1320085Z     )
2025-05-07T20:33:44.1320369Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1320483Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1320574Z         self,
2025-05-07T20:33:44.1320664Z         T: int,
2025-05-07T20:33:44.1320758Z         D: int,
2025-05-07T20:33:44.1320875Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1320986Z         contiguous: bool,
2025-05-07T20:33:44.1321087Z         compiled: bool,
2025-05-07T20:33:44.1321183Z     ) -> None:
2025-05-07T20:33:44.1321300Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1321386Z     
2025-05-07T20:33:44.1321581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1321674Z     
2025-05-07T20:33:44.1321790Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1321939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1322048Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1322142Z         x0 = x[:, :D]
2025-05-07T20:33:44.1322236Z         x1 = x[:, D:]
2025-05-07T20:33:44.1322326Z     
2025-05-07T20:33:44.1322424Z         if contiguous:
2025-05-07T20:33:44.1322538Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1322642Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1322731Z     
2025-05-07T20:33:44.1322844Z         if scale_ub is not None:
2025-05-07T20:33:44.1322968Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1323221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1323316Z             )
2025-05-07T20:33:44.1323406Z         else:
2025-05-07T20:33:44.1323518Z             scale_ub_tensor = None
2025-05-07T20:33:44.1323608Z     
2025-05-07T20:33:44.1323764Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1324135Z             op = silu_mul_quant
2025-05-07T20:33:44.1324273Z             if compiled:
2025-05-07T20:33:44.1324392Z                 op = torch.compile(op)
2025-05-07T20:33:44.1324520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1324606Z     
2025-05-07T20:33:44.1324714Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1324720Z 
2025-05-07T20:33:44.1324837Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1324988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1325106Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1325230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1325655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1325765Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1326330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1326643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1327059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1327317Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1327707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1327819Z     kernel = self.compile(
2025-05-07T20:33:44.1328256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1328469Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1328618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1328623Z 
2025-05-07T20:33:44.1328860Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07487550>
2025-05-07T20:33:44.1329758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1330333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07454820>}
2025-05-07T20:33:44.1331182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1331414Z context = <triton._C.libtriton.ir.context object at 0x7fad071febf0>
2025-05-07T20:33:44.1331419Z 
2025-05-07T20:33:44.1331610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1331921Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1332047Z                            module_map=module_map)
2025-05-07T20:33:44.1332240Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1332355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1332446Z E       ^
2025-05-07T20:33:44.1332855Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1332860Z 
2025-05-07T20:33:44.1333332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1333404Z 
2025-05-07T20:33:44.1333529Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1333851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1333943Z     T=4096,
2025-05-07T20:33:44.1334034Z     D=5120,
2025-05-07T20:33:44.1334139Z     scale_ub=1200.0,
2025-05-07T20:33:44.1334244Z     contiguous=True,
2025-05-07T20:33:44.1334344Z     compiled=True,
2025-05-07T20:33:44.1334433Z )
2025-05-07T20:33:44.1334686Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1334886Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1334891Z 
2025-05-07T20:33:44.1334985Z     @given(
2025-05-07T20:33:44.1335125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1335247Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1335380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1335521Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1335663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1335751Z     )
2025-05-07T20:33:44.1336034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1336145Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1336286Z         self,
2025-05-07T20:33:44.1336377Z         T: int,
2025-05-07T20:33:44.1336517Z         D: int,
2025-05-07T20:33:44.1336635Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1336740Z         contiguous: bool,
2025-05-07T20:33:44.1336843Z         compiled: bool,
2025-05-07T20:33:44.1336936Z     ) -> None:
2025-05-07T20:33:44.1337051Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1337137Z     
2025-05-07T20:33:44.1337333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1337422Z     
2025-05-07T20:33:44.1337530Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1337675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1337785Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1337882Z         x0 = x[:, :D]
2025-05-07T20:33:44.1337976Z         x1 = x[:, D:]
2025-05-07T20:33:44.1338064Z     
2025-05-07T20:33:44.1338162Z         if contiguous:
2025-05-07T20:33:44.1338270Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1338384Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1338473Z     
2025-05-07T20:33:44.1338580Z         if scale_ub is not None:
2025-05-07T20:33:44.1338706Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1338861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1338955Z             )
2025-05-07T20:33:44.1339046Z         else:
2025-05-07T20:33:44.1339157Z             scale_ub_tensor = None
2025-05-07T20:33:44.1339245Z     
2025-05-07T20:33:44.1339395Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1339500Z             op = silu_mul_quant
2025-05-07T20:33:44.1339608Z             if compiled:
2025-05-07T20:33:44.1339726Z                 op = torch.compile(op)
2025-05-07T20:33:44.1339853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1339941Z     
2025-05-07T20:33:44.1340047Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1340052Z 
2025-05-07T20:33:44.1340170Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1340326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1340445Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1340565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1340988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1341099Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1341689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1341805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1342307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1342566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1342956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1343073Z     kernel = self.compile(
2025-05-07T20:33:44.1343508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1343708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1343859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1343864Z 
2025-05-07T20:33:44.1344098Z self = <triton.compiler.compiler.ASTSource object at 0x7fad074a4040>
2025-05-07T20:33:44.1344982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1345556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07455360>}
2025-05-07T20:33:44.1346483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1346705Z context = <triton._C.libtriton.ir.context object at 0x7fad0715d7b0>
2025-05-07T20:33:44.1346710Z 
2025-05-07T20:33:44.1346901Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1347206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1347330Z                            module_map=module_map)
2025-05-07T20:33:44.1347526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1347643Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1347733Z E       ^
2025-05-07T20:33:44.1348138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1348147Z 
2025-05-07T20:33:44.1348617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1348622Z 
2025-05-07T20:33:44.1348743Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1349001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1349092Z     T=128,
2025-05-07T20:33:44.1349184Z     D=5120,
2025-05-07T20:33:44.1349281Z     scale_ub=1200.0,
2025-05-07T20:33:44.1349382Z     contiguous=False,
2025-05-07T20:33:44.1349481Z     compiled=True,
2025-05-07T20:33:44.1349570Z )
2025-05-07T20:33:44.1349819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1350022Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1350027Z 
2025-05-07T20:33:44.1350117Z     @given(
2025-05-07T20:33:44.1350253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1350377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1350509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1350647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1350779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1350865Z     )
2025-05-07T20:33:44.1351150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1351258Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1351346Z         self,
2025-05-07T20:33:44.1351439Z         T: int,
2025-05-07T20:33:44.1351529Z         D: int,
2025-05-07T20:33:44.1351710Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1351829Z         contiguous: bool,
2025-05-07T20:33:44.1351985Z         compiled: bool,
2025-05-07T20:33:44.1352091Z     ) -> None:
2025-05-07T20:33:44.1352231Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1352318Z     
2025-05-07T20:33:44.1352521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1352613Z     
2025-05-07T20:33:44.1352723Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1352871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1352976Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1353069Z         x0 = x[:, :D]
2025-05-07T20:33:44.1353168Z         x1 = x[:, D:]
2025-05-07T20:33:44.1353254Z     
2025-05-07T20:33:44.1353352Z         if contiguous:
2025-05-07T20:33:44.1353462Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1353624Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1353710Z     
2025-05-07T20:33:44.1353819Z         if scale_ub is not None:
2025-05-07T20:33:44.1353945Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1354109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1354198Z             )
2025-05-07T20:33:44.1354288Z         else:
2025-05-07T20:33:44.1354403Z             scale_ub_tensor = None
2025-05-07T20:33:44.1354542Z     
2025-05-07T20:33:44.1354734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1354846Z             op = silu_mul_quant
2025-05-07T20:33:44.1354945Z             if compiled:
2025-05-07T20:33:44.1355064Z                 op = torch.compile(op)
2025-05-07T20:33:44.1355191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1355277Z     
2025-05-07T20:33:44.1355384Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1355389Z 
2025-05-07T20:33:44.1355504Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1355653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1355777Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1355897Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1356317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1356429Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1357005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1357119Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1357535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1357793Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1358187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1358297Z     kernel = self.compile(
2025-05-07T20:33:44.1358741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1358949Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1359096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1359105Z 
2025-05-07T20:33:44.1359347Z self = <triton.compiler.compiler.ASTSource object at 0x7fad071e0580>
2025-05-07T20:33:44.1360236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1360813Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07456290>}
2025-05-07T20:33:44.1361708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1361997Z context = <triton._C.libtriton.ir.context object at 0x7fad070dfc30>
2025-05-07T20:33:44.1362001Z 
2025-05-07T20:33:44.1362196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1362505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1362630Z                            module_map=module_map)
2025-05-07T20:33:44.1362821Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1362937Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1363030Z E       ^
2025-05-07T20:33:44.1363434Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1363439Z 
2025-05-07T20:33:44.1363908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1363918Z 
2025-05-07T20:33:44.1364043Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1364303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1364394Z     T=16384,
2025-05-07T20:33:44.1364535Z     D=7168,
2025-05-07T20:33:44.1364636Z     scale_ub=1200.0,
2025-05-07T20:33:44.1364781Z     contiguous=True,
2025-05-07T20:33:44.1364882Z     compiled=True,
2025-05-07T20:33:44.1364970Z )
2025-05-07T20:33:44.1365225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1365426Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1365431Z 
2025-05-07T20:33:44.1365522Z     @given(
2025-05-07T20:33:44.1365663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1365779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1365913Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1366059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1366191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1366282Z     )
2025-05-07T20:33:44.1366566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1366680Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1366775Z         self,
2025-05-07T20:33:44.1366868Z         T: int,
2025-05-07T20:33:44.1366958Z         D: int,
2025-05-07T20:33:44.1367075Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1367181Z         contiguous: bool,
2025-05-07T20:33:44.1367281Z         compiled: bool,
2025-05-07T20:33:44.1367376Z     ) -> None:
2025-05-07T20:33:44.1367486Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1367572Z     
2025-05-07T20:33:44.1367770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1367859Z     
2025-05-07T20:33:44.1367971Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1368122Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1368229Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1368326Z         x0 = x[:, :D]
2025-05-07T20:33:44.1368420Z         x1 = x[:, D:]
2025-05-07T20:33:44.1368506Z     
2025-05-07T20:33:44.1368606Z         if contiguous:
2025-05-07T20:33:44.1368716Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1368824Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1368912Z     
2025-05-07T20:33:44.1369018Z         if scale_ub is not None:
2025-05-07T20:33:44.1369140Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1369298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1369387Z             )
2025-05-07T20:33:44.1369478Z         else:
2025-05-07T20:33:44.1369594Z             scale_ub_tensor = None
2025-05-07T20:33:44.1369680Z     
2025-05-07T20:33:44.1369834Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1369992Z             op = silu_mul_quant
2025-05-07T20:33:44.1370091Z             if compiled:
2025-05-07T20:33:44.1370257Z                 op = torch.compile(op)
2025-05-07T20:33:44.1370405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1370513Z     
2025-05-07T20:33:44.1370649Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1370659Z 
2025-05-07T20:33:44.1370800Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1370991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1371144Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1371289Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1371819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1371954Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1372656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1372807Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1373266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1373525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1374011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1374123Z     kernel = self.compile(
2025-05-07T20:33:44.1374566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1374768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1374915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1374920Z 
2025-05-07T20:33:44.1375160Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07187f10>
2025-05-07T20:33:44.1376044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1376632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07456d40>}
2025-05-07T20:33:44.1377479Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1377707Z context = <triton._C.libtriton.ir.context object at 0x7fad070f29f0>
2025-05-07T20:33:44.1377712Z 
2025-05-07T20:33:44.1377904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1378208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1378344Z                            module_map=module_map)
2025-05-07T20:33:44.1378531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1378646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1378741Z E       ^
2025-05-07T20:33:44.1379152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1379160Z 
2025-05-07T20:33:44.1379637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1379642Z 
2025-05-07T20:33:44.1379762Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1380017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1380113Z     T=16384,
2025-05-07T20:33:44.1380206Z     D=5120,
2025-05-07T20:33:44.1380305Z     scale_ub=1200.0,
2025-05-07T20:33:44.1380410Z     contiguous=True,
2025-05-07T20:33:44.1380559Z     compiled=False,
2025-05-07T20:33:44.1380649Z )
2025-05-07T20:33:44.1380944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1381150Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1381155Z 
2025-05-07T20:33:44.1381252Z     @given(
2025-05-07T20:33:44.1381393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1381525Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1381682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1381835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1381967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1382056Z     )
2025-05-07T20:33:44.1382341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1382458Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1382549Z         self,
2025-05-07T20:33:44.1382643Z         T: int,
2025-05-07T20:33:44.1382736Z         D: int,
2025-05-07T20:33:44.1382857Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1382961Z         contiguous: bool,
2025-05-07T20:33:44.1383065Z         compiled: bool,
2025-05-07T20:33:44.1383156Z     ) -> None:
2025-05-07T20:33:44.1383267Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1383407Z     
2025-05-07T20:33:44.1383642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1383729Z     
2025-05-07T20:33:44.1383841Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1383986Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1384093Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1384187Z         x0 = x[:, :D]
2025-05-07T20:33:44.1384283Z         x1 = x[:, D:]
2025-05-07T20:33:44.1384372Z     
2025-05-07T20:33:44.1384470Z         if contiguous:
2025-05-07T20:33:44.1384577Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1384690Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1384779Z     
2025-05-07T20:33:44.1384886Z         if scale_ub is not None:
2025-05-07T20:33:44.1385014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1385171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1385261Z             )
2025-05-07T20:33:44.1385358Z         else:
2025-05-07T20:33:44.1385470Z             scale_ub_tensor = None
2025-05-07T20:33:44.1385558Z     
2025-05-07T20:33:44.1385712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1385818Z             op = silu_mul_quant
2025-05-07T20:33:44.1385920Z             if compiled:
2025-05-07T20:33:44.1386038Z                 op = torch.compile(op)
2025-05-07T20:33:44.1386160Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1386248Z     
2025-05-07T20:33:44.1386354Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1386359Z 
2025-05-07T20:33:44.1386472Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1386624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1386746Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1386862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1387439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1387557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1387975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1388232Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1388625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1388737Z     kernel = self.compile(
2025-05-07T20:33:44.1389179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1389451Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1389643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1389648Z 
2025-05-07T20:33:44.1389886Z self = <triton.compiler.compiler.ASTSource object at 0x7fad07184d90>
2025-05-07T20:33:44.1390776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1391350Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad07457ac0>}
2025-05-07T20:33:44.1392198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1392426Z context = <triton._C.libtriton.ir.context object at 0x7fad070e3e30>
2025-05-07T20:33:44.1392431Z 
2025-05-07T20:33:44.1392623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1392931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1393144Z                            module_map=module_map)
2025-05-07T20:33:44.1393340Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1393456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1393594Z E       ^
2025-05-07T20:33:44.1394006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1394011Z 
2025-05-07T20:33:44.1394485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1394489Z 
2025-05-07T20:33:44.1394617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1394877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1394967Z     T=1,
2025-05-07T20:33:44.1395062Z     D=7168,
2025-05-07T20:33:44.1395160Z     scale_ub=1200.0,
2025-05-07T20:33:44.1395261Z     contiguous=False,
2025-05-07T20:33:44.1395369Z     compiled=False,
2025-05-07T20:33:44.1395456Z )
2025-05-07T20:33:44.1395710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1395910Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1395915Z 
2025-05-07T20:33:44.1396007Z     @given(
2025-05-07T20:33:44.1396144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1396264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1396399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1396538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1396677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1396764Z     )
2025-05-07T20:33:44.1397057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1397168Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1397258Z         self,
2025-05-07T20:33:44.1397353Z         T: int,
2025-05-07T20:33:44.1397443Z         D: int,
2025-05-07T20:33:44.1397561Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1397671Z         contiguous: bool,
2025-05-07T20:33:44.1397773Z         compiled: bool,
2025-05-07T20:33:44.1397869Z     ) -> None:
2025-05-07T20:33:44.1397979Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1398066Z     
2025-05-07T20:33:44.1398267Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1398355Z     
2025-05-07T20:33:44.1398463Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1398611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1398799Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1398893Z         x0 = x[:, :D]
2025-05-07T20:33:44.1399034Z         x1 = x[:, D:]
2025-05-07T20:33:44.1399121Z     
2025-05-07T20:33:44.1399219Z         if contiguous:
2025-05-07T20:33:44.1399331Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1399436Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1399523Z     
2025-05-07T20:33:44.1399636Z         if scale_ub is not None:
2025-05-07T20:33:44.1399760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1399921Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1400011Z             )
2025-05-07T20:33:44.1400100Z         else:
2025-05-07T20:33:44.1400215Z             scale_ub_tensor = None
2025-05-07T20:33:44.1400301Z     
2025-05-07T20:33:44.1400452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1400561Z             op = silu_mul_quant
2025-05-07T20:33:44.1400661Z             if compiled:
2025-05-07T20:33:44.1400780Z                 op = torch.compile(op)
2025-05-07T20:33:44.1400911Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1400997Z     
2025-05-07T20:33:44.1401103Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1401111Z 
2025-05-07T20:33:44.1401224Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1401373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1401585Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1401704Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1402274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1402392Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1402802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1403065Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1403462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1403574Z     kernel = self.compile(
2025-05-07T20:33:44.1404015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1404224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1404371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1404376Z 
2025-05-07T20:33:44.1404615Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06fc5870>
2025-05-07T20:33:44.1405497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1406079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa8550>}
2025-05-07T20:33:44.1406928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1407158Z context = <triton._C.libtriton.ir.context object at 0x7fad06f238b0>
2025-05-07T20:33:44.1407163Z 
2025-05-07T20:33:44.1407355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1407659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1407790Z                            module_map=module_map)
2025-05-07T20:33:44.1407978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1408096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1408190Z E       ^
2025-05-07T20:33:44.1408686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1408692Z 
2025-05-07T20:33:44.1409167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1409175Z 
2025-05-07T20:33:44.1409298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1409559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1409654Z     T=4096,
2025-05-07T20:33:44.1409744Z     D=7168,
2025-05-07T20:33:44.1409843Z     scale_ub=1200.0,
2025-05-07T20:33:44.1409948Z     contiguous=False,
2025-05-07T20:33:44.1410047Z     compiled=True,
2025-05-07T20:33:44.1410136Z )
2025-05-07T20:33:44.1410387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1410589Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1410594Z 
2025-05-07T20:33:44.1410691Z     @given(
2025-05-07T20:33:44.1410828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1410948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1411085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1411222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1411431Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1411565Z     )
2025-05-07T20:33:44.1411901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1412015Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1412106Z         self,
2025-05-07T20:33:44.1412198Z         T: int,
2025-05-07T20:33:44.1412292Z         D: int,
2025-05-07T20:33:44.1412407Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1412511Z         contiguous: bool,
2025-05-07T20:33:44.1412615Z         compiled: bool,
2025-05-07T20:33:44.1412707Z     ) -> None:
2025-05-07T20:33:44.1412817Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1412908Z     
2025-05-07T20:33:44.1413108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1413195Z     
2025-05-07T20:33:44.1413305Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1413450Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1413562Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1413658Z         x0 = x[:, :D]
2025-05-07T20:33:44.1413754Z         x1 = x[:, D:]
2025-05-07T20:33:44.1413843Z     
2025-05-07T20:33:44.1413941Z         if contiguous:
2025-05-07T20:33:44.1414048Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1414156Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1414242Z     
2025-05-07T20:33:44.1414349Z         if scale_ub is not None:
2025-05-07T20:33:44.1414477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1414632Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1414721Z             )
2025-05-07T20:33:44.1414816Z         else:
2025-05-07T20:33:44.1414928Z             scale_ub_tensor = None
2025-05-07T20:33:44.1415020Z     
2025-05-07T20:33:44.1415171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1415277Z             op = silu_mul_quant
2025-05-07T20:33:44.1415379Z             if compiled:
2025-05-07T20:33:44.1415498Z                 op = torch.compile(op)
2025-05-07T20:33:44.1415624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1415712Z     
2025-05-07T20:33:44.1415820Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1415825Z 
2025-05-07T20:33:44.1415939Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1416091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1416208Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1416325Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1416746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1416910Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1417516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1417633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1418045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1418310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1418702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1418814Z     kernel = self.compile(
2025-05-07T20:33:44.1419252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1419457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1419605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1419614Z 
2025-05-07T20:33:44.1419853Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06fc6f80>
2025-05-07T20:33:44.1420853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1421502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa8f70>}
2025-05-07T20:33:44.1422342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1422567Z context = <triton._C.libtriton.ir.context object at 0x7fad06f00c30>
2025-05-07T20:33:44.1422575Z 
2025-05-07T20:33:44.1422767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1423071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1423196Z                            module_map=module_map)
2025-05-07T20:33:44.1423385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1423506Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1423596Z E       ^
2025-05-07T20:33:44.1424395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1424407Z 
2025-05-07T20:33:44.1424882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1424887Z 
2025-05-07T20:33:44.1425008Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1425265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1425360Z     T=128,
2025-05-07T20:33:44.1425453Z     D=7168,
2025-05-07T20:33:44.1425554Z     scale_ub=1200.0,
2025-05-07T20:33:44.1425655Z     contiguous=False,
2025-05-07T20:33:44.1425752Z     compiled=True,
2025-05-07T20:33:44.1425841Z )
2025-05-07T20:33:44.1426090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1426297Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:44.1426302Z 
2025-05-07T20:33:44.1426392Z     @given(
2025-05-07T20:33:44.1426530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1426648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1426781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1426916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1427053Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1427140Z     )
2025-05-07T20:33:44.1427516Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1427699Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1427790Z         self,
2025-05-07T20:33:44.1427884Z         T: int,
2025-05-07T20:33:44.1427973Z         D: int,
2025-05-07T20:33:44.1428087Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1428197Z         contiguous: bool,
2025-05-07T20:33:44.1428299Z         compiled: bool,
2025-05-07T20:33:44.1428390Z     ) -> None:
2025-05-07T20:33:44.1428503Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1428590Z     
2025-05-07T20:33:44.1428785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1428876Z     
2025-05-07T20:33:44.1428984Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1429129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1429236Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1429330Z         x0 = x[:, :D]
2025-05-07T20:33:44.1429427Z         x1 = x[:, D:]
2025-05-07T20:33:44.1429517Z     
2025-05-07T20:33:44.1429613Z         if contiguous:
2025-05-07T20:33:44.1429728Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1429831Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1429918Z     
2025-05-07T20:33:44.1430025Z         if scale_ub is not None:
2025-05-07T20:33:44.1430222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1430460Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1430553Z             )
2025-05-07T20:33:44.1430642Z         else:
2025-05-07T20:33:44.1430751Z             scale_ub_tensor = None
2025-05-07T20:33:44.1430841Z     
2025-05-07T20:33:44.1431014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1431146Z             op = silu_mul_quant
2025-05-07T20:33:44.1431245Z             if compiled:
2025-05-07T20:33:44.1431360Z                 op = torch.compile(op)
2025-05-07T20:33:44.1431484Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1431572Z     
2025-05-07T20:33:44.1431677Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1431682Z 
2025-05-07T20:33:44.1431801Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1431949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1432066Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1432190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1432608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1432719Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1433277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1433406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1434031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1434384Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1434924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1435080Z     kernel = self.compile(
2025-05-07T20:33:44.1435568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1435781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1435930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1435935Z 
2025-05-07T20:33:44.1436171Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06ee5f60>
2025-05-07T20:33:44.1437050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1437743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06fa92d0>}
2025-05-07T20:33:44.1438592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1438815Z context = <triton._C.libtriton.ir.context object at 0x7fad06ebafb0>
2025-05-07T20:33:44.1438821Z 
2025-05-07T20:33:44.1439014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1439318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1439442Z                            module_map=module_map)
2025-05-07T20:33:44.1439630Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1439746Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1439840Z E       ^
2025-05-07T20:33:44.1440248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1440252Z 
2025-05-07T20:33:44.1440721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1440771Z 
2025-05-07T20:33:44.1440937Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1441192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1441282Z     T=2048,
2025-05-07T20:33:44.1441374Z     D=7168,
2025-05-07T20:33:44.1441472Z     scale_ub=None,
2025-05-07T20:33:44.1441572Z     contiguous=True,
2025-05-07T20:33:44.1441673Z     compiled=True,
2025-05-07T20:33:44.1441758Z )
2025-05-07T20:33:44.1442007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1442205Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.1442213Z 
2025-05-07T20:33:44.1442303Z     @given(
2025-05-07T20:33:44.1442445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1442561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1442694Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1442839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1442974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1443060Z     )
2025-05-07T20:33:44.1443344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1443452Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1443546Z         self,
2025-05-07T20:33:44.1443640Z         T: int,
2025-05-07T20:33:44.1443729Z         D: int,
2025-05-07T20:33:44.1443849Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1443953Z         contiguous: bool,
2025-05-07T20:33:44.1444053Z         compiled: bool,
2025-05-07T20:33:44.1444152Z     ) -> None:
2025-05-07T20:33:44.1444262Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1444349Z     
2025-05-07T20:33:44.1444547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1444634Z     
2025-05-07T20:33:44.1444742Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1444892Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1444998Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1445091Z         x0 = x[:, :D]
2025-05-07T20:33:44.1445187Z         x1 = x[:, D:]
2025-05-07T20:33:44.1445272Z     
2025-05-07T20:33:44.1445372Z         if contiguous:
2025-05-07T20:33:44.1445478Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1445580Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1445667Z     
2025-05-07T20:33:44.1445772Z         if scale_ub is not None:
2025-05-07T20:33:44.1445894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1446053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1446192Z             )
2025-05-07T20:33:44.1446281Z         else:
2025-05-07T20:33:44.1446443Z             scale_ub_tensor = None
2025-05-07T20:33:44.1446530Z     
2025-05-07T20:33:44.1446681Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1446789Z             op = silu_mul_quant
2025-05-07T20:33:44.1446890Z             if compiled:
2025-05-07T20:33:44.1447011Z                 op = torch.compile(op)
2025-05-07T20:33:44.1447133Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1447218Z     
2025-05-07T20:33:44.1447326Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1447331Z 
2025-05-07T20:33:44.1447443Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1447591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1447711Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1447829Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1448245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1448363Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1448923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1449039Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1449531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1449789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1450182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1450290Z     kernel = self.compile(
2025-05-07T20:33:44.1450728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1450930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1451082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1451087Z 
2025-05-07T20:33:44.1451323Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06e7c790>
2025-05-07T20:33:44.1452250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1452830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06faa560>}
2025-05-07T20:33:44.1453673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1453893Z context = <triton._C.libtriton.ir.context object at 0x7fad06ef59f0>
2025-05-07T20:33:44.1453900Z 
2025-05-07T20:33:44.1454096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1454398Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1454528Z                            module_map=module_map)
2025-05-07T20:33:44.1454719Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1454833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1454929Z E       ^
2025-05-07T20:33:44.1455333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1455338Z 
2025-05-07T20:33:44.1455805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1455815Z 
2025-05-07T20:33:44.1455935Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1456237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1456373Z     T=16384,
2025-05-07T20:33:44.1456464Z     D=5120,
2025-05-07T20:33:44.1460064Z     scale_ub=None,
2025-05-07T20:33:44.1460186Z     contiguous=False,
2025-05-07T20:33:44.1460289Z     compiled=False,
2025-05-07T20:33:44.1460385Z )
2025-05-07T20:33:44.1460646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1460859Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1460865Z 
2025-05-07T20:33:44.1460956Z     @given(
2025-05-07T20:33:44.1461094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1461214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1461348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1461487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1461647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1461758Z     )
2025-05-07T20:33:44.1462051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1462166Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1462256Z         self,
2025-05-07T20:33:44.1462349Z         T: int,
2025-05-07T20:33:44.1462514Z         D: int,
2025-05-07T20:33:44.1462629Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1462781Z         contiguous: bool,
2025-05-07T20:33:44.1462884Z         compiled: bool,
2025-05-07T20:33:44.1462977Z     ) -> None:
2025-05-07T20:33:44.1463092Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1463180Z     
2025-05-07T20:33:44.1463374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1463465Z     
2025-05-07T20:33:44.1463573Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1463717Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1465769Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1465780Z 
2025-05-07T20:33:44.1465919Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:44.1465927Z 
2025-05-07T20:33:44.1466049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1466305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1466402Z     T=4096,
2025-05-07T20:33:44.1466491Z     D=7168,
2025-05-07T20:33:44.1466588Z     scale_ub=1200.0,
2025-05-07T20:33:44.1466689Z     contiguous=True,
2025-05-07T20:33:44.1466792Z     compiled=True,
2025-05-07T20:33:44.1466882Z )
2025-05-07T20:33:44.1467136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1467332Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1467337Z 
2025-05-07T20:33:44.1467434Z     @given(
2025-05-07T20:33:44.1467573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1467688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1467826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1467960Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1468094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1468187Z     )
2025-05-07T20:33:44.1468468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1468578Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1468672Z         self,
2025-05-07T20:33:44.1468814Z         T: int,
2025-05-07T20:33:44.1468904Z         D: int,
2025-05-07T20:33:44.1469067Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1469173Z         contiguous: bool,
2025-05-07T20:33:44.1469278Z         compiled: bool,
2025-05-07T20:33:44.1469371Z     ) -> None:
2025-05-07T20:33:44.1469481Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1469572Z     
2025-05-07T20:33:44.1469768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1469856Z     
2025-05-07T20:33:44.1469967Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1470113Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1472175Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1472184Z 
2025-05-07T20:33:44.1472322Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:44.1472374Z 
2025-05-07T20:33:44.1472538Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1472797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1472888Z     T=16384,
2025-05-07T20:33:44.1472983Z     D=7168,
2025-05-07T20:33:44.1473079Z     scale_ub=None,
2025-05-07T20:33:44.1473181Z     contiguous=False,
2025-05-07T20:33:44.1473283Z     compiled=False,
2025-05-07T20:33:44.1473371Z )
2025-05-07T20:33:44.1473756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1473959Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1473968Z 
2025-05-07T20:33:44.1474059Z     @given(
2025-05-07T20:33:44.1474199Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1474318Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1474450Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1474589Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1474728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1474817Z     )
2025-05-07T20:33:44.1475103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1475212Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1475302Z         self,
2025-05-07T20:33:44.1475398Z         T: int,
2025-05-07T20:33:44.1475488Z         D: int,
2025-05-07T20:33:44.1475605Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1475713Z         contiguous: bool,
2025-05-07T20:33:44.1475812Z         compiled: bool,
2025-05-07T20:33:44.1475908Z     ) -> None:
2025-05-07T20:33:44.1476021Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1476111Z     
2025-05-07T20:33:44.1476305Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1478318Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1478327Z 
2025-05-07T20:33:44.1478463Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1478468Z 
2025-05-07T20:33:44.1478590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1478842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1479031Z     T=2048,
2025-05-07T20:33:44.1479128Z     D=7168,
2025-05-07T20:33:44.1479230Z     scale_ub=1200.0,
2025-05-07T20:33:44.1479332Z     contiguous=True,
2025-05-07T20:33:44.1479434Z     compiled=True,
2025-05-07T20:33:44.1479524Z )
2025-05-07T20:33:44.1479780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1479979Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1479984Z 
2025-05-07T20:33:44.1480077Z     @given(
2025-05-07T20:33:44.1480215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1480329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1480463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1480601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1480733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1480822Z     )
2025-05-07T20:33:44.1481108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1481219Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1481308Z         self,
2025-05-07T20:33:44.1481402Z         T: int,
2025-05-07T20:33:44.1481492Z         D: int,
2025-05-07T20:33:44.1481659Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1481804Z         contiguous: bool,
2025-05-07T20:33:44.1481906Z         compiled: bool,
2025-05-07T20:33:44.1482002Z     ) -> None:
2025-05-07T20:33:44.1482112Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1482197Z     
2025-05-07T20:33:44.1482393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1482483Z     
2025-05-07T20:33:44.1482591Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1482740Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1484750Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1484763Z 
2025-05-07T20:33:44.1484904Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:44.1484909Z 
2025-05-07T20:33:44.1485029Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1485287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1485378Z     T=2048,
2025-05-07T20:33:44.1485469Z     D=7168,
2025-05-07T20:33:44.1485569Z     scale_ub=None,
2025-05-07T20:33:44.1485668Z     contiguous=True,
2025-05-07T20:33:44.1485766Z     compiled=False,
2025-05-07T20:33:44.1485859Z )
2025-05-07T20:33:44.1486111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1486309Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1486313Z 
2025-05-07T20:33:44.1486408Z     @given(
2025-05-07T20:33:44.1486545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1486666Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1486798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1486932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1487066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1487153Z     )
2025-05-07T20:33:44.1487436Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1487548Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1487639Z         self,
2025-05-07T20:33:44.1487731Z         T: int,
2025-05-07T20:33:44.1487875Z         D: int,
2025-05-07T20:33:44.1487990Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1488136Z         contiguous: bool,
2025-05-07T20:33:44.1488242Z         compiled: bool,
2025-05-07T20:33:44.1488336Z     ) -> None:
2025-05-07T20:33:44.1488452Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1488537Z     
2025-05-07T20:33:44.1488735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1488829Z     
2025-05-07T20:33:44.1488938Z >       x_sign = torch.sign(x)
2025-05-07T20:33:44.1490940Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1490956Z 
2025-05-07T20:33:44.1491095Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:44.1491100Z 
2025-05-07T20:33:44.1491220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1491525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1491657Z     T=1,
2025-05-07T20:33:44.1491749Z     D=7168,
2025-05-07T20:33:44.1491852Z     scale_ub=1200.0,
2025-05-07T20:33:44.1491951Z     contiguous=True,
2025-05-07T20:33:44.1492048Z     compiled=False,
2025-05-07T20:33:44.1492138Z )
2025-05-07T20:33:44.1492385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1492578Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1492583Z 
2025-05-07T20:33:44.1492673Z     @given(
2025-05-07T20:33:44.1492809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1492930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1493065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1493200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1493335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1493425Z     )
2025-05-07T20:33:44.1493717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1493828Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1493917Z         self,
2025-05-07T20:33:44.1494012Z         T: int,
2025-05-07T20:33:44.1494101Z         D: int,
2025-05-07T20:33:44.1494214Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1494323Z         contiguous: bool,
2025-05-07T20:33:44.1494424Z         compiled: bool,
2025-05-07T20:33:44.1494515Z     ) -> None:
2025-05-07T20:33:44.1494629Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1494715Z     
2025-05-07T20:33:44.1494907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1495001Z     
2025-05-07T20:33:44.1495111Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1495255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1495361Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1495457Z         x0 = x[:, :D]
2025-05-07T20:33:44.1495556Z         x1 = x[:, D:]
2025-05-07T20:33:44.1495643Z     
2025-05-07T20:33:44.1495745Z         if contiguous:
2025-05-07T20:33:44.1495857Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1495963Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1496048Z     
2025-05-07T20:33:44.1496157Z         if scale_ub is not None:
2025-05-07T20:33:44.1496279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1496436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1496530Z             )
2025-05-07T20:33:44.1496619Z         else:
2025-05-07T20:33:44.1496728Z             scale_ub_tensor = None
2025-05-07T20:33:44.1496898Z     
2025-05-07T20:33:44.1497051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1497204Z             op = silu_mul_quant
2025-05-07T20:33:44.1497305Z             if compiled:
2025-05-07T20:33:44.1497422Z                 op = torch.compile(op)
2025-05-07T20:33:44.1497548Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1497637Z     
2025-05-07T20:33:44.1497744Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1497749Z 
2025-05-07T20:33:44.1497867Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1498016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1498134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1498253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1498824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1498942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1499359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1499616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1500010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1500208Z     kernel = self.compile(
2025-05-07T20:33:44.1500715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1500973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1501156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1501163Z 
2025-05-07T20:33:44.1501463Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06b71000>
2025-05-07T20:33:44.1502562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1503242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4c4c0>}
2025-05-07T20:33:44.1504095Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1504315Z context = <triton._C.libtriton.ir.context object at 0x7fad06b300f0>
2025-05-07T20:33:44.1504320Z 
2025-05-07T20:33:44.1504514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1504816Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1504948Z                            module_map=module_map)
2025-05-07T20:33:44.1505145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1505260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1505353Z E       ^
2025-05-07T20:33:44.1505758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1505772Z 
2025-05-07T20:33:44.1506243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1506251Z 
2025-05-07T20:33:44.1506372Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1506625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1506719Z     T=128,
2025-05-07T20:33:44.1506808Z     D=5120,
2025-05-07T20:33:44.1506905Z     scale_ub=None,
2025-05-07T20:33:44.1507007Z     contiguous=True,
2025-05-07T20:33:44.1507106Z     compiled=False,
2025-05-07T20:33:44.1507192Z )
2025-05-07T20:33:44.1507494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1507729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1507735Z 
2025-05-07T20:33:44.1507829Z     @given(
2025-05-07T20:33:44.1507967Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1508089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1508226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1508365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1508497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1508587Z     )
2025-05-07T20:33:44.1508869Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1508978Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1509073Z         self,
2025-05-07T20:33:44.1509163Z         T: int,
2025-05-07T20:33:44.1509253Z         D: int,
2025-05-07T20:33:44.1509372Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1509476Z         contiguous: bool,
2025-05-07T20:33:44.1509583Z         compiled: bool,
2025-05-07T20:33:44.1509674Z     ) -> None:
2025-05-07T20:33:44.1509785Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1509873Z     
2025-05-07T20:33:44.1510069Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1510246Z     
2025-05-07T20:33:44.1510359Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1510505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1510611Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1510711Z         x0 = x[:, :D]
2025-05-07T20:33:44.1510806Z         x1 = x[:, D:]
2025-05-07T20:33:44.1510895Z     
2025-05-07T20:33:44.1510998Z         if contiguous:
2025-05-07T20:33:44.1511126Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1511256Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1511368Z     
2025-05-07T20:33:44.1511501Z         if scale_ub is not None:
2025-05-07T20:33:44.1511664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1511863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1511973Z             )
2025-05-07T20:33:44.1512088Z         else:
2025-05-07T20:33:44.1512225Z             scale_ub_tensor = None
2025-05-07T20:33:44.1512335Z     
2025-05-07T20:33:44.1512515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1512621Z             op = silu_mul_quant
2025-05-07T20:33:44.1512720Z             if compiled:
2025-05-07T20:33:44.1512837Z                 op = torch.compile(op)
2025-05-07T20:33:44.1512959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1513043Z     
2025-05-07T20:33:44.1513151Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1513156Z 
2025-05-07T20:33:44.1513270Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1513423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1513598Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1513715Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1514285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1514399Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1514810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1515068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1515453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1515565Z     kernel = self.compile(
2025-05-07T20:33:44.1515996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1516197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1516443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1516448Z 
2025-05-07T20:33:44.1516681Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d696f0>
2025-05-07T20:33:44.1517557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1518127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4c940>}
2025-05-07T20:33:44.1518964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1519183Z context = <triton._C.libtriton.ir.context object at 0x7fad06d576f0>
2025-05-07T20:33:44.1519190Z 
2025-05-07T20:33:44.1519381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1519685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1519808Z                            module_map=module_map)
2025-05-07T20:33:44.1520080Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1520200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1520291Z E       ^
2025-05-07T20:33:44.1520694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1520700Z 
2025-05-07T20:33:44.1521168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1521173Z 
2025-05-07T20:33:44.1521292Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1521549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1521641Z     T=128,
2025-05-07T20:33:44.1521735Z     D=7168,
2025-05-07T20:33:44.1521834Z     scale_ub=None,
2025-05-07T20:33:44.1521932Z     contiguous=True,
2025-05-07T20:33:44.1522034Z     compiled=False,
2025-05-07T20:33:44.1522124Z )
2025-05-07T20:33:44.1522374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1522571Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1522576Z 
2025-05-07T20:33:44.1522666Z     @given(
2025-05-07T20:33:44.1522803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1522922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1523054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1523189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1523328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1523417Z     )
2025-05-07T20:33:44.1523708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1524167Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1524307Z         self,
2025-05-07T20:33:44.1524446Z         T: int,
2025-05-07T20:33:44.1524573Z         D: int,
2025-05-07T20:33:44.1524698Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1524809Z         contiguous: bool,
2025-05-07T20:33:44.1524909Z         compiled: bool,
2025-05-07T20:33:44.1525000Z     ) -> None:
2025-05-07T20:33:44.1525114Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1525199Z     
2025-05-07T20:33:44.1525394Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1525484Z     
2025-05-07T20:33:44.1525592Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1525742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1525846Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1525939Z         x0 = x[:, :D]
2025-05-07T20:33:44.1526142Z         x1 = x[:, D:]
2025-05-07T20:33:44.1526228Z     
2025-05-07T20:33:44.1526391Z         if contiguous:
2025-05-07T20:33:44.1526505Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1526609Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1526696Z     
2025-05-07T20:33:44.1526805Z         if scale_ub is not None:
2025-05-07T20:33:44.1526932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1527088Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1527180Z             )
2025-05-07T20:33:44.1527268Z         else:
2025-05-07T20:33:44.1527383Z             scale_ub_tensor = None
2025-05-07T20:33:44.1527470Z     
2025-05-07T20:33:44.1527619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1527728Z             op = silu_mul_quant
2025-05-07T20:33:44.1527828Z             if compiled:
2025-05-07T20:33:44.1527943Z                 op = torch.compile(op)
2025-05-07T20:33:44.1528073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1528162Z     
2025-05-07T20:33:44.1528270Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1528275Z 
2025-05-07T20:33:44.1528392Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1528541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1528732Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1528937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1529502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1529619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1530023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1530276Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1530665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1530777Z     kernel = self.compile(
2025-05-07T20:33:44.1531213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1531416Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1531566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1531571Z 
2025-05-07T20:33:44.1531807Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d49480>
2025-05-07T20:33:44.1532670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1533241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4d240>}
2025-05-07T20:33:44.1534080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1534299Z context = <triton._C.libtriton.ir.context object at 0x7fad06dd2e70>
2025-05-07T20:33:44.1534306Z 
2025-05-07T20:33:44.1534507Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1534805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1534932Z                            module_map=module_map)
2025-05-07T20:33:44.1535115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1535231Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1535326Z E       ^
2025-05-07T20:33:44.1535728Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1535783Z 
2025-05-07T20:33:44.1536289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1536299Z 
2025-05-07T20:33:44.1536418Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1536672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1536770Z     T=2048,
2025-05-07T20:33:44.1536861Z     D=7168,
2025-05-07T20:33:44.1536959Z     scale_ub=1200.0,
2025-05-07T20:33:44.1537062Z     contiguous=True,
2025-05-07T20:33:44.1537160Z     compiled=False,
2025-05-07T20:33:44.1537247Z )
2025-05-07T20:33:44.1537499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1537698Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1537703Z 
2025-05-07T20:33:44.1537800Z     @given(
2025-05-07T20:33:44.1537938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1538058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1538197Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1538331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1538465Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1538601Z     )
2025-05-07T20:33:44.1538925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1539037Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1539132Z         self,
2025-05-07T20:33:44.1539221Z         T: int,
2025-05-07T20:33:44.1539312Z         D: int,
2025-05-07T20:33:44.1539431Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1539534Z         contiguous: bool,
2025-05-07T20:33:44.1539637Z         compiled: bool,
2025-05-07T20:33:44.1539730Z     ) -> None:
2025-05-07T20:33:44.1539839Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1539926Z     
2025-05-07T20:33:44.1540120Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1542119Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1542133Z 
2025-05-07T20:33:44.1542270Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1542275Z 
2025-05-07T20:33:44.1542394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1542649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1542738Z     T=1,
2025-05-07T20:33:44.1542829Z     D=5120,
2025-05-07T20:33:44.1542930Z     scale_ub=1200.0,
2025-05-07T20:33:44.1543039Z     contiguous=True,
2025-05-07T20:33:44.1543142Z     compiled=False,
2025-05-07T20:33:44.1543227Z )
2025-05-07T20:33:44.1543475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1543670Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1543677Z 
2025-05-07T20:33:44.1543765Z     @given(
2025-05-07T20:33:44.1543900Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1544022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1544154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1544289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1544423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1544508Z     )
2025-05-07T20:33:44.1544791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1544983Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1545072Z         self,
2025-05-07T20:33:44.1545208Z         T: int,
2025-05-07T20:33:44.1545298Z         D: int,
2025-05-07T20:33:44.1545413Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1545521Z         contiguous: bool,
2025-05-07T20:33:44.1545628Z         compiled: bool,
2025-05-07T20:33:44.1545719Z     ) -> None:
2025-05-07T20:33:44.1545838Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1545923Z     
2025-05-07T20:33:44.1546115Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1546203Z     
2025-05-07T20:33:44.1546311Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1546458Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1546561Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1546656Z         x0 = x[:, :D]
2025-05-07T20:33:44.1546754Z         x1 = x[:, D:]
2025-05-07T20:33:44.1546839Z     
2025-05-07T20:33:44.1546936Z         if contiguous:
2025-05-07T20:33:44.1547051Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1547159Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1547244Z     
2025-05-07T20:33:44.1547352Z         if scale_ub is not None:
2025-05-07T20:33:44.1547473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1547630Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1547815Z             )
2025-05-07T20:33:44.1547907Z         else:
2025-05-07T20:33:44.1548017Z             scale_ub_tensor = None
2025-05-07T20:33:44.1548104Z     
2025-05-07T20:33:44.1548254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1548363Z             op = silu_mul_quant
2025-05-07T20:33:44.1548464Z             if compiled:
2025-05-07T20:33:44.1548580Z                 op = torch.compile(op)
2025-05-07T20:33:44.1548703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1548787Z     
2025-05-07T20:33:44.1548892Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1548901Z 
2025-05-07T20:33:44.1549019Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1549170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1549288Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1549410Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1549983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1550099Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1550503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1550760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1551152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1551260Z     kernel = self.compile(
2025-05-07T20:33:44.1551708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1551910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1552056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1552064Z 
2025-05-07T20:33:44.1552301Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06d49b10>
2025-05-07T20:33:44.1553173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1553799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06b4e200>}
2025-05-07T20:33:44.1554684Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1554944Z context = <triton._C.libtriton.ir.context object at 0x7fad06c710b0>
2025-05-07T20:33:44.1554950Z 
2025-05-07T20:33:44.1555142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1555448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1555577Z                            module_map=module_map)
2025-05-07T20:33:44.1555762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1555876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1555969Z E       ^
2025-05-07T20:33:44.1556373Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1556378Z 
2025-05-07T20:33:44.1556841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1556858Z 
2025-05-07T20:33:44.1556979Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1557232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1557326Z     T=2048,
2025-05-07T20:33:44.1557461Z     D=5120,
2025-05-07T20:33:44.1557556Z     scale_ub=None,
2025-05-07T20:33:44.1557701Z     contiguous=True,
2025-05-07T20:33:44.1557800Z     compiled=False,
2025-05-07T20:33:44.1557885Z )
2025-05-07T20:33:44.1558135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1558333Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1558338Z 
2025-05-07T20:33:44.1558432Z     @given(
2025-05-07T20:33:44.1558567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1558686Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1558822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1558962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1559095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1559185Z     )
2025-05-07T20:33:44.1559465Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1559577Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1559674Z         self,
2025-05-07T20:33:44.1559765Z         T: int,
2025-05-07T20:33:44.1559854Z         D: int,
2025-05-07T20:33:44.1559972Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1560076Z         contiguous: bool,
2025-05-07T20:33:44.1560181Z         compiled: bool,
2025-05-07T20:33:44.1560271Z     ) -> None:
2025-05-07T20:33:44.1560381Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1560469Z     
2025-05-07T20:33:44.1560659Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1560745Z     
2025-05-07T20:33:44.1560854Z >       x_sign = torch.sign(x)
2025-05-07T20:33:44.1562840Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1562850Z 
2025-05-07T20:33:44.1562991Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:44.1562996Z 
2025-05-07T20:33:44.1563116Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1563369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1563465Z     T=16384,
2025-05-07T20:33:44.1563555Z     D=5120,
2025-05-07T20:33:44.1563705Z     scale_ub=None,
2025-05-07T20:33:44.1563804Z     contiguous=True,
2025-05-07T20:33:44.1563942Z     compiled=False,
2025-05-07T20:33:44.1564031Z )
2025-05-07T20:33:44.1564277Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1564476Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1564484Z 
2025-05-07T20:33:44.1564580Z     @given(
2025-05-07T20:33:44.1564714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1564829Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1564964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1565098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1565232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1565318Z     )
2025-05-07T20:33:44.1565598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1565714Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1565809Z         self,
2025-05-07T20:33:44.1565901Z         T: int,
2025-05-07T20:33:44.1565995Z         D: int,
2025-05-07T20:33:44.1566109Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1566212Z         contiguous: bool,
2025-05-07T20:33:44.1566316Z         compiled: bool,
2025-05-07T20:33:44.1566481Z     ) -> None:
2025-05-07T20:33:44.1566631Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1566721Z     
2025-05-07T20:33:44.1566912Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1568896Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1568905Z 
2025-05-07T20:33:44.1569041Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1569046Z 
2025-05-07T20:33:44.1569167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1569425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1569517Z     T=4096,
2025-05-07T20:33:44.1569612Z     D=5120,
2025-05-07T20:33:44.1569708Z     scale_ub=None,
2025-05-07T20:33:44.1569806Z     contiguous=True,
2025-05-07T20:33:44.1569909Z     compiled=False,
2025-05-07T20:33:44.1569995Z )
2025-05-07T20:33:44.1570241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1570441Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1570446Z 
2025-05-07T20:33:44.1570538Z     @given(
2025-05-07T20:33:44.1570679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1570797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1570928Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1571066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1571197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1571287Z     )
2025-05-07T20:33:44.1571599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1571712Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1571811Z         self,
2025-05-07T20:33:44.1571920Z         T: int,
2025-05-07T20:33:44.1572020Z         D: int,
2025-05-07T20:33:44.1572151Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1572255Z         contiguous: bool,
2025-05-07T20:33:44.1572355Z         compiled: bool,
2025-05-07T20:33:44.1572449Z     ) -> None:
2025-05-07T20:33:44.1572560Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1572648Z     
2025-05-07T20:33:44.1572896Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1575012Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1575024Z 
2025-05-07T20:33:44.1575166Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1575171Z 
2025-05-07T20:33:44.1575291Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1575545Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1575642Z     T=2048,
2025-05-07T20:33:44.1575731Z     D=5120,
2025-05-07T20:33:44.1575834Z     scale_ub=None,
2025-05-07T20:33:44.1575935Z     contiguous=False,
2025-05-07T20:33:44.1576032Z     compiled=False,
2025-05-07T20:33:44.1576123Z )
2025-05-07T20:33:44.1576369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1576653Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1576658Z 
2025-05-07T20:33:44.1576751Z     @given(
2025-05-07T20:33:44.1576888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1577004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1577139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1577275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1577409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1577496Z     )
2025-05-07T20:33:44.1577776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1577891Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1577982Z         self,
2025-05-07T20:33:44.1578073Z         T: int,
2025-05-07T20:33:44.1578167Z         D: int,
2025-05-07T20:33:44.1578281Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1578385Z         contiguous: bool,
2025-05-07T20:33:44.1578493Z         compiled: bool,
2025-05-07T20:33:44.1578587Z     ) -> None:
2025-05-07T20:33:44.1578701Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1578790Z     
2025-05-07T20:33:44.1578982Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1580968Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1580977Z 
2025-05-07T20:33:44.1581113Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1581121Z 
2025-05-07T20:33:44.1581245Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1581525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1581616Z     T=4096,
2025-05-07T20:33:44.1581709Z     D=7168,
2025-05-07T20:33:44.1581805Z     scale_ub=None,
2025-05-07T20:33:44.1581905Z     contiguous=True,
2025-05-07T20:33:44.1582006Z     compiled=True,
2025-05-07T20:33:44.1582093Z )
2025-05-07T20:33:44.1582340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1582539Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.1582594Z 
2025-05-07T20:33:44.1582685Z     @given(
2025-05-07T20:33:44.1582864Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1582981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1583112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1583250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1583387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1583474Z     )
2025-05-07T20:33:44.1583757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1583865Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1583955Z         self,
2025-05-07T20:33:44.1584048Z         T: int,
2025-05-07T20:33:44.1584138Z         D: int,
2025-05-07T20:33:44.1584256Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1584359Z         contiguous: bool,
2025-05-07T20:33:44.1584459Z         compiled: bool,
2025-05-07T20:33:44.1584553Z     ) -> None:
2025-05-07T20:33:44.1584665Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1584749Z     
2025-05-07T20:33:44.1584948Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1586960Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1587008Z 
2025-05-07T20:33:44.1587147Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1587152Z 
2025-05-07T20:33:44.1587270Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1587523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1587620Z     T=2048,
2025-05-07T20:33:44.1587711Z     D=5120,
2025-05-07T20:33:44.1587811Z     scale_ub=1200.0,
2025-05-07T20:33:44.1587910Z     contiguous=False,
2025-05-07T20:33:44.1588007Z     compiled=False,
2025-05-07T20:33:44.1588095Z )
2025-05-07T20:33:44.1588346Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1588545Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1588549Z 
2025-05-07T20:33:44.1588640Z     @given(
2025-05-07T20:33:44.1588774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1588888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1589024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1589157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1589290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1589379Z     )
2025-05-07T20:33:44.1589660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1589771Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1589860Z         self,
2025-05-07T20:33:44.1589949Z         T: int,
2025-05-07T20:33:44.1590041Z         D: int,
2025-05-07T20:33:44.1590158Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1590266Z         contiguous: bool,
2025-05-07T20:33:44.1590369Z         compiled: bool,
2025-05-07T20:33:44.1590459Z     ) -> None:
2025-05-07T20:33:44.1590568Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1590656Z     
2025-05-07T20:33:44.1590847Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1592871Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1592921Z 
2025-05-07T20:33:44.1593057Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1593064Z 
2025-05-07T20:33:44.1593187Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1593438Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1593574Z     T=4096,
2025-05-07T20:33:44.1593669Z     D=7168,
2025-05-07T20:33:44.1593769Z     scale_ub=1200.0,
2025-05-07T20:33:44.1593867Z     contiguous=True,
2025-05-07T20:33:44.1593968Z     compiled=False,
2025-05-07T20:33:44.1594053Z )
2025-05-07T20:33:44.1594296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1594494Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1594502Z 
2025-05-07T20:33:44.1594595Z     @given(
2025-05-07T20:33:44.1594730Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1594844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1594975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1595206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1595338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1595424Z     )
2025-05-07T20:33:44.1595708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1595817Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1595909Z         self,
2025-05-07T20:33:44.1595999Z         T: int,
2025-05-07T20:33:44.1596088Z         D: int,
2025-05-07T20:33:44.1596205Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1596309Z         contiguous: bool,
2025-05-07T20:33:44.1596409Z         compiled: bool,
2025-05-07T20:33:44.1596505Z     ) -> None:
2025-05-07T20:33:44.1596618Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1596703Z     
2025-05-07T20:33:44.1596898Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1598876Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1598889Z 
2025-05-07T20:33:44.1599029Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1599034Z 
2025-05-07T20:33:44.1599154Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1599407Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1599502Z     T=16384,
2025-05-07T20:33:44.1599591Z     D=7168,
2025-05-07T20:33:44.1599692Z     scale_ub=None,
2025-05-07T20:33:44.1599793Z     contiguous=False,
2025-05-07T20:33:44.1599891Z     compiled=True,
2025-05-07T20:33:44.1599978Z )
2025-05-07T20:33:44.1600226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1600425Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:44.1600430Z 
2025-05-07T20:33:44.1600521Z     @given(
2025-05-07T20:33:44.1600654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1600790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1604156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1604317Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1604525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1604613Z     )
2025-05-07T20:33:44.1604974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1605086Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1605175Z         self,
2025-05-07T20:33:44.1605276Z         T: int,
2025-05-07T20:33:44.1605366Z         D: int,
2025-05-07T20:33:44.1605483Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1605592Z         contiguous: bool,
2025-05-07T20:33:44.1605693Z         compiled: bool,
2025-05-07T20:33:44.1605790Z     ) -> None:
2025-05-07T20:33:44.1605900Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1605985Z     
2025-05-07T20:33:44.1606181Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1608190Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1608244Z 
2025-05-07T20:33:44.1608427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1608432Z 
2025-05-07T20:33:44.1608554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1608807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1608901Z     T=4096,
2025-05-07T20:33:44.1608990Z     D=7168,
2025-05-07T20:33:44.1609086Z     scale_ub=None,
2025-05-07T20:33:44.1609189Z     contiguous=True,
2025-05-07T20:33:44.1609288Z     compiled=False,
2025-05-07T20:33:44.1609377Z )
2025-05-07T20:33:44.1609622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1609825Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1609830Z 
2025-05-07T20:33:44.1609923Z     @given(
2025-05-07T20:33:44.1610059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1610174Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1610317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1610453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1610583Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1610673Z     )
2025-05-07T20:33:44.1610956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1611098Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1611205Z         self,
2025-05-07T20:33:44.1611302Z         T: int,
2025-05-07T20:33:44.1611393Z         D: int,
2025-05-07T20:33:44.1611509Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1611615Z         contiguous: bool,
2025-05-07T20:33:44.1611720Z         compiled: bool,
2025-05-07T20:33:44.1611815Z     ) -> None:
2025-05-07T20:33:44.1611924Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1612017Z     
2025-05-07T20:33:44.1612207Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1614203Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1614210Z 
2025-05-07T20:33:44.1614347Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1614401Z 
2025-05-07T20:33:44.1614566Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1614820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1614915Z     T=16384,
2025-05-07T20:33:44.1615008Z     D=7168,
2025-05-07T20:33:44.1615106Z     scale_ub=None,
2025-05-07T20:33:44.1615205Z     contiguous=True,
2025-05-07T20:33:44.1615308Z     compiled=False,
2025-05-07T20:33:44.1615394Z )
2025-05-07T20:33:44.1615639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1615844Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:44.1615849Z 
2025-05-07T20:33:44.1615940Z     @given(
2025-05-07T20:33:44.1616078Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1616191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1616322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1616466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1616602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1616687Z     )
2025-05-07T20:33:44.1616970Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1617079Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1617214Z         self,
2025-05-07T20:33:44.1617349Z         T: int,
2025-05-07T20:33:44.1617439Z         D: int,
2025-05-07T20:33:44.1617552Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1617659Z         contiguous: bool,
2025-05-07T20:33:44.1617757Z         compiled: bool,
2025-05-07T20:33:44.1617851Z     ) -> None:
2025-05-07T20:33:44.1617960Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1618045Z     
2025-05-07T20:33:44.1618238Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1620232Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1620244Z 
2025-05-07T20:33:44.1620383Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1620388Z 
2025-05-07T20:33:44.1620506Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1620756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1620852Z     T=16384,
2025-05-07T20:33:44.1620943Z     D=7168,
2025-05-07T20:33:44.1621041Z     scale_ub=1200.0,
2025-05-07T20:33:44.1621145Z     contiguous=True,
2025-05-07T20:33:44.1621244Z     compiled=False,
2025-05-07T20:33:44.1621335Z )
2025-05-07T20:33:44.1621616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1621842Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1621846Z 
2025-05-07T20:33:44.1621949Z     @given(
2025-05-07T20:33:44.1622098Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1622220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1622354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1622491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1622623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1622709Z     )
2025-05-07T20:33:44.1622993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1623104Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1623202Z         self,
2025-05-07T20:33:44.1623293Z         T: int,
2025-05-07T20:33:44.1623383Z         D: int,
2025-05-07T20:33:44.1623556Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1623704Z         contiguous: bool,
2025-05-07T20:33:44.1624147Z         compiled: bool,
2025-05-07T20:33:44.1624293Z     ) -> None:
2025-05-07T20:33:44.1624413Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1624499Z     
2025-05-07T20:33:44.1624706Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1626707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1626716Z 
2025-05-07T20:33:44.1626857Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1626865Z 
2025-05-07T20:33:44.1626984Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1627240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1627436Z     T=128,
2025-05-07T20:33:44.1627527Z     D=5120,
2025-05-07T20:33:44.1627692Z     scale_ub=1200.0,
2025-05-07T20:33:44.1627795Z     contiguous=False,
2025-05-07T20:33:44.1627893Z     compiled=False,
2025-05-07T20:33:44.1627981Z )
2025-05-07T20:33:44.1628231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1628427Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:44.1628432Z 
2025-05-07T20:33:44.1628527Z     @given(
2025-05-07T20:33:44.1628661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1628779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1628914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1629053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1629192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1629280Z     )
2025-05-07T20:33:44.1629560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1629678Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1629770Z         self,
2025-05-07T20:33:44.1629860Z         T: int,
2025-05-07T20:33:44.1629954Z         D: int,
2025-05-07T20:33:44.1630067Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1630171Z         contiguous: bool,
2025-05-07T20:33:44.1630275Z         compiled: bool,
2025-05-07T20:33:44.1630367Z     ) -> None:
2025-05-07T20:33:44.1630482Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1630569Z     
2025-05-07T20:33:44.1630786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1630888Z     
2025-05-07T20:33:44.1631015Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1631163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1631271Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1631365Z         x0 = x[:, :D]
2025-05-07T20:33:44.1631460Z         x1 = x[:, D:]
2025-05-07T20:33:44.1631549Z     
2025-05-07T20:33:44.1631651Z         if contiguous:
2025-05-07T20:33:44.1631762Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1631872Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1631957Z     
2025-05-07T20:33:44.1632067Z         if scale_ub is not None:
2025-05-07T20:33:44.1632191Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1632346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1632441Z             )
2025-05-07T20:33:44.1632531Z         else:
2025-05-07T20:33:44.1632641Z             scale_ub_tensor = None
2025-05-07T20:33:44.1632730Z     
2025-05-07T20:33:44.1632881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1633062Z             op = silu_mul_quant
2025-05-07T20:33:44.1633227Z             if compiled:
2025-05-07T20:33:44.1633346Z                 op = torch.compile(op)
2025-05-07T20:33:44.1633468Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1633612Z     
2025-05-07T20:33:44.1633719Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1633727Z 
2025-05-07T20:33:44.1633847Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1633998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1634116Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1634235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1634808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1634921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1635338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1635600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1635996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1636106Z     kernel = self.compile(
2025-05-07T20:33:44.1636642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1636848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1636996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1637000Z 
2025-05-07T20:33:44.1637241Z self = <triton.compiler.compiler.ASTSource object at 0x7fad069d3d60>
2025-05-07T20:33:44.1638125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1638706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06c11ea0>}
2025-05-07T20:33:44.1639555Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1639778Z context = <triton._C.libtriton.ir.context object at 0x7fad06a16f70>
2025-05-07T20:33:44.1639783Z 
2025-05-07T20:33:44.1639975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1640276Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1640402Z                            module_map=module_map)
2025-05-07T20:33:44.1640593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1640712Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1640806Z E       ^
2025-05-07T20:33:44.1641213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1641219Z 
2025-05-07T20:33:44.1641695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1641700Z 
2025-05-07T20:33:44.1641827Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1642081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1642172Z     T=2048,
2025-05-07T20:33:44.1642266Z     D=7168,
2025-05-07T20:33:44.1642362Z     scale_ub=None,
2025-05-07T20:33:44.1642469Z     contiguous=False,
2025-05-07T20:33:44.1642567Z     compiled=False,
2025-05-07T20:33:44.1642653Z )
2025-05-07T20:33:44.1642905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1643184Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:44.1643232Z 
2025-05-07T20:33:44.1643322Z     @given(
2025-05-07T20:33:44.1643462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1643579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1643716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1643858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1643990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1644079Z     )
2025-05-07T20:33:44.1644360Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1644469Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1644561Z         self,
2025-05-07T20:33:44.1644652Z         T: int,
2025-05-07T20:33:44.1644742Z         D: int,
2025-05-07T20:33:44.1644858Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1644962Z         contiguous: bool,
2025-05-07T20:33:44.1645066Z         compiled: bool,
2025-05-07T20:33:44.1645162Z     ) -> None:
2025-05-07T20:33:44.1645276Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1645361Z     
2025-05-07T20:33:44.1645560Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1647594Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1647643Z 
2025-05-07T20:33:44.1647788Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1647795Z 
2025-05-07T20:33:44.1647915Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1648175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1648266Z     T=128,
2025-05-07T20:33:44.1648355Z     D=7168,
2025-05-07T20:33:44.1648457Z     scale_ub=1200.0,
2025-05-07T20:33:44.1648559Z     contiguous=True,
2025-05-07T20:33:44.1648657Z     compiled=True,
2025-05-07T20:33:44.1648751Z )
2025-05-07T20:33:44.1648997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1649189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1649194Z 
2025-05-07T20:33:44.1649289Z     @given(
2025-05-07T20:33:44.1649424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1649543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1649676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1649810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1649950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1650040Z     )
2025-05-07T20:33:44.1650343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1650485Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1650596Z         self,
2025-05-07T20:33:44.1650712Z         T: int,
2025-05-07T20:33:44.1650830Z         D: int,
2025-05-07T20:33:44.1650973Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1651107Z         contiguous: bool,
2025-05-07T20:33:44.1651232Z         compiled: bool,
2025-05-07T20:33:44.1651346Z     ) -> None:
2025-05-07T20:33:44.1651486Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1651593Z     
2025-05-07T20:33:44.1651831Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1651942Z     
2025-05-07T20:33:44.1652075Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1652224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1652388Z         x = x_sign * x_clamp
2025-05-07T20:33:44.1652482Z         x0 = x[:, :D]
2025-05-07T20:33:44.1652618Z         x1 = x[:, D:]
2025-05-07T20:33:44.1652707Z     
2025-05-07T20:33:44.1652805Z         if contiguous:
2025-05-07T20:33:44.1652913Z             x0 = x0.contiguous()
2025-05-07T20:33:44.1653024Z             x1 = x1.contiguous()
2025-05-07T20:33:44.1653108Z     
2025-05-07T20:33:44.1653219Z         if scale_ub is not None:
2025-05-07T20:33:44.1653340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.1653494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.1653585Z             )
2025-05-07T20:33:44.1653674Z         else:
2025-05-07T20:33:44.1653783Z             scale_ub_tensor = None
2025-05-07T20:33:44.1653871Z     
2025-05-07T20:33:44.1654019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.1654125Z             op = silu_mul_quant
2025-05-07T20:33:44.1654226Z             if compiled:
2025-05-07T20:33:44.1654345Z                 op = torch.compile(op)
2025-05-07T20:33:44.1654470Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1654558Z     
2025-05-07T20:33:44.1654663Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:44.1654668Z 
2025-05-07T20:33:44.1654783Z moe/activation_test.py:117: 
2025-05-07T20:33:44.1654978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1655135Z moe/activation_test.py:115: in fn
2025-05-07T20:33:44.1655255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.1655673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:44.1655782Z     return fn(*args, **kwargs)
2025-05-07T20:33:44.1656340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:44.1656454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:44.1656943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.1657314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.1657842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.1658001Z     kernel = self.compile(
2025-05-07T20:33:44.1658492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.1658695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.1658847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.1658852Z 
2025-05-07T20:33:44.1659084Z self = <triton.compiler.compiler.ASTSource object at 0x7fad06ad0fa0>
2025-05-07T20:33:44.1659959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.1660529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7fae7228eef0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7fad06c137f0>}
2025-05-07T20:33:44.1661388Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.1661645Z context = <triton._C.libtriton.ir.context object at 0x7fad06ac5730>
2025-05-07T20:33:44.1661650Z 
2025-05-07T20:33:44.1661843Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.1662147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.1662273Z                            module_map=module_map)
2025-05-07T20:33:44.1662527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.1662688Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:44.1662779Z E       ^
2025-05-07T20:33:44.1663183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.1663191Z 
2025-05-07T20:33:44.1663657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.1663662Z 
2025-05-07T20:33:44.1663788Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1664040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1664131Z     T=128,
2025-05-07T20:33:44.1664225Z     D=7168,
2025-05-07T20:33:44.1664322Z     scale_ub=1200.0,
2025-05-07T20:33:44.1664421Z     contiguous=True,
2025-05-07T20:33:44.1664520Z     compiled=False,
2025-05-07T20:33:44.1664605Z )
2025-05-07T20:33:44.1664856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1665056Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:44.1665062Z 
2025-05-07T20:33:44.1665151Z     @given(
2025-05-07T20:33:44.1665285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1665458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1665630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1665770Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1665904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1665991Z     )
2025-05-07T20:33:44.1666276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1666384Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1666473Z         self,
2025-05-07T20:33:44.1666565Z         T: int,
2025-05-07T20:33:44.1666654Z         D: int,
2025-05-07T20:33:44.1666767Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1666876Z         contiguous: bool,
2025-05-07T20:33:44.1666980Z         compiled: bool,
2025-05-07T20:33:44.1667075Z     ) -> None:
2025-05-07T20:33:44.1667186Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1667271Z     
2025-05-07T20:33:44.1667468Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1667556Z     
2025-05-07T20:33:44.1667665Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1667814Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1669788Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1669797Z 
2025-05-07T20:33:44.1669938Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:44.1669943Z 
2025-05-07T20:33:44.1670062Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1670319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1670414Z     T=128,
2025-05-07T20:33:44.1670505Z     D=5120,
2025-05-07T20:33:44.1670605Z     scale_ub=1200.0,
2025-05-07T20:33:44.1670718Z     contiguous=True,
2025-05-07T20:33:44.1670825Z     compiled=True,
2025-05-07T20:33:44.1670933Z )
2025-05-07T20:33:44.1671181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1671370Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:44.1671375Z 
2025-05-07T20:33:44.1671467Z     @given(
2025-05-07T20:33:44.1671601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1671768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1671945Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1672080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1672217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1672306Z     )
2025-05-07T20:33:44.1672588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1672699Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1672788Z         self,
2025-05-07T20:33:44.1672878Z         T: int,
2025-05-07T20:33:44.1672968Z         D: int,
2025-05-07T20:33:44.1673083Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1673186Z         contiguous: bool,
2025-05-07T20:33:44.1673287Z         compiled: bool,
2025-05-07T20:33:44.1673377Z     ) -> None:
2025-05-07T20:33:44.1673487Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1673637Z     
2025-05-07T20:33:44.1673834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1673923Z     
2025-05-07T20:33:44.1674038Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.1674183Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.1676215Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1676286Z 
2025-05-07T20:33:44.1676424Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:44.1676429Z 
2025-05-07T20:33:44.1676556Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.1676813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.1676903Z     T=128,
2025-05-07T20:33:44.1676995Z     D=7168,
2025-05-07T20:33:44.1677092Z     scale_ub=None,
2025-05-07T20:33:44.1677191Z     contiguous=True,
2025-05-07T20:33:44.1677296Z     compiled=True,
2025-05-07T20:33:44.1677382Z )
2025-05-07T20:33:44.1677630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.1677822Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.1677827Z 
2025-05-07T20:33:44.1677916Z     @given(
2025-05-07T20:33:44.1678054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.1678168Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.1678299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.1678435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.1678564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.1678653Z     )
2025-05-07T20:33:44.1678939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.1679046Z     def test_silu_mul_quant(
2025-05-07T20:33:44.1679135Z         self,
2025-05-07T20:33:44.1679228Z         T: int,
2025-05-07T20:33:44.1679320Z         D: int,
2025-05-07T20:33:44.1679439Z         scale_ub: Optional[float],
2025-05-07T20:33:44.1679546Z         contiguous: bool,
2025-05-07T20:33:44.1679644Z         compiled: bool,
2025-05-07T20:33:44.1679737Z     ) -> None:
2025-05-07T20:33:44.1679847Z         torch.manual_seed(2025)
2025-05-07T20:33:44.1679933Z     
2025-05-07T20:33:44.1680128Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.1682234Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:44.1682285Z 
2025-05-07T20:33:44.1682428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:44.1682584Z =============================== warnings summary ===============================
2025-05-07T20:33:44.1682935Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:44.1683285Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:44.1683626Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:44.1684626Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:44.1684893Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:44.1684944Z 
2025-05-07T20:33:44.1685233Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:44.1685428Z ================= 1 failed, 1 deselected, 3 warnings in 18.32s =================
2025-05-07T20:33:45.7790626Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:45.8414473Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:45.8414974Z 
2025-05-07T20:33:45.8415327Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:45.8416543Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:45.8417382Z 
2025-05-07T20:33:45.8417391Z 
2025-05-07T20:33:45.8417399Z 
2025-05-07T20:33:45.8435879Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:45.8516140Z Post job cleanup.
2025-05-07T20:33:45.9520806Z [command]/usr/bin/git version
2025-05-07T20:33:45.9566818Z git version 2.47.1
2025-05-07T20:33:45.9605867Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/b68dad46-55d8-40b5-a0ad-1b7c3566ed55/.gitconfig'
2025-05-07T20:33:45.9617301Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/b68dad46-55d8-40b5-a0ad-1b7c3566ed55' before making global git config changes
2025-05-07T20:33:45.9618203Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:45.9622944Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:45.9685902Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:45.9721244Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:46.0059874Z Entering 'external/asmjit'
2025-05-07T20:33:46.0126393Z Entering 'external/composable_kernel'
2025-05-07T20:33:46.0199912Z Entering 'external/cpuinfo'
2025-05-07T20:33:46.0267710Z Entering 'external/cutlass'
2025-05-07T20:33:46.0342560Z Entering 'external/googletest'
2025-05-07T20:33:46.0409502Z Entering 'external/hipify_torch'
2025-05-07T20:33:46.0476022Z Entering 'external/json'
2025-05-07T20:33:46.0563314Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:46.0588695Z http.https://github.com/.extraheader
2025-05-07T20:33:46.0601173Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:46.0632559Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:46.0962955Z Entering 'external/asmjit'
2025-05-07T20:33:46.1006108Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1050104Z Entering 'external/composable_kernel'
2025-05-07T20:33:46.1093104Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1142595Z Entering 'external/cpuinfo'
2025-05-07T20:33:46.1184714Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1228112Z Entering 'external/cutlass'
2025-05-07T20:33:46.1270361Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1321553Z Entering 'external/googletest'
2025-05-07T20:33:46.1364211Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1407038Z Entering 'external/hipify_torch'
2025-05-07T20:33:46.1450101Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1492560Z Entering 'external/json'
2025-05-07T20:33:46.1535621Z http.https://github.com/.extraheader
2025-05-07T20:33:46.1686274Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:46.1721409Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:46.1732523Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:46.1732906Z ##[endgroup]
2025-05-07T20:33:46.1833679Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:57.3815575Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:34:14.4723516Z Cleaning up orphan processes